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Abstract This article presents a mathematical frame- 
work to simultaneously tackle the problems of 3D re- 
construction, pose estimation and object classification, 
from a single 2D image. In sharp contrast with state 
of the art methods that rely primarily on 2D infor- 
mation and solve each of these three problems sepa- 
rately or iteratively, we propose a mathematical frame- 
work that incorporates prior "knowledge" about the 3D 
shapes of different object classes and solves these prob- 
lems jointly and simultaneously, using a hypothesize- 
and-bound (H&B) algorithm [M] . 

In the proposed H&B algorithm one hypothesis is 
defined for each possible pair [object class, object pose], 
and the algorithm selects the hypothesis H that max- 
imizes a function L{H) encoding how well each hy- 
pothesis "explains" the input image. To find this max- 
imum efficiently, the function L{H) is not evaluated 
exactly for each hypothesis iJ, but rather upper and 
lower bounds for it are computed at a much lower cost. 
In order to obtain bounds for L{H) that are tight yet 
inexpensive to compute, we extend the theory of shapes 
described in [14' to handle projections of shapes. This 
extension allows us to define a probabilistic relationship 
between the prior knowledge given in 3D and the 2D 
input image. This relationship is derived from first prin- 
ciples and is proven to be the only relationship having 
the properties that we intuitively expect from a "pro- 
jection." 

In addition to the efficiency and optimality charac- 
teristics of H&B algorithms, the proposed framework 
has the desirable property of integrating information 



in the 2D image with information in the 3D prior to 
estimate the optimal reconstruction. While this article 
focuses primarily on the problem mentioned above, we 
believe that the theory presented herein has multiple 
other potential applications. 

Keywords 3D reconstruction • pose estimation • ob- 
ject classification • shapes • shape priors • hypothesize- 
and-verify • coarse-to-fine • probabilistic inference • 
graphical models • image understanding 



1 Introduction 

It is in general easy for humans to "perceive" three di- 
mensional (3D) objects, even when presented with a 
two dimensional (2D) image alone. This situation in 
which one does not rely on binocular (stereo) vision to 
"perceive" 3D commonly arises when one is closing an 
eye, looking at a picture or screen, or simply processing 
objects in the monocular parts of the visual field. The 
ability to "perceive" the world in 3D is essential to in- 
teract with the environment and to "understand" the 
observed images, which arguably is only achieved when 
the underlying 3D structure of a scene is understood. 
By trying to design machines that replicate this ability, 
we come to appreciate the tremendous complexity of 
this problem and the marvelous proficiency of our own 
visual system. 



1.1 In defense of 3D "knowledge" 
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This problem, namely 3D reconstruction from a single 
2D image, is inherently ambiguous when no other con- 
straints are imposed. Humans presumably solve it by 
relying on knowledge about the principles underlying 
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the creation of images (exploited in ^3.4[ ), knowledge 
about the specific object classes involved (see below), 
and knowledge about the laws of nature that govern the 
physical interactions between objects. 

While the knowledge about the object classes can 
take many forms, in this work we focus on 3D shape 



priors (described in detail in ^3.3). This choice of a 3D 
representation of knowledge, as opposed to a 2D repre- 
sentation, has several important consequences. First, it 
results in a framework that can handle images of an ob- 
ject from any viewpoint, rather than an arbitrary view- 
point included in the training dataset. In other words, 
it makes the framework viewpoint independent. Second, 
it allows us to store the knowledge "efficiently," because 
there is no need to store specific information for each 
viewpoint. Instead, the same (common) "knowledge" is 
used for any viewpoint. Third, because this knowledge 
is common to all the viewpoints, the number of training 
examples required to acquire it is reduced. And fourth, 
the fact that the prior knowledge is represented in 3D 
allows us to easily impose constrains on the physical 
interactions between objects that would be very hard 
to impose otherwise (e.^., that an object is resting on 
a supporting plane, not "flying" above it). 

1.2 In defense of synergy 

In order to incorporate prior "knowledge" about an ob- 
ject in its 3D reconstruction, in particular prior knowl- 
edge about its 3D shape, it is necessary to classify the 
object and estimate its location/pose. On the one hand, 
classifying the object is essential to be able to consider 
only the specific prior knowledge about the class of the 
particular object in the image, among the possibly vast 
amount of available general knowledge. This specific 
knowledge is particularly important to reconstruct the 
parts of the object that are not visible in the image 
(e.^., those that are occluded), since in this case the 
prior knowledge might be the only source of informa- 
tion. On the other hand, estimating the pose of the 
object is necessary to incorporate the prior geometric 
knowledge in the right spatial locations. For example, 
knowing that an object is a 'mug,' tells us that it is ex- 
pected to have a handle somewhere; knowing also the 
pose of the object tells us where the handle is supposed 
to be. Similarly to classify the object and estimate its 
pose, it is helpful to rely on the information in its 3D 
reconstruction. 

This suggests that all these problems (3D recon- 
struction, object classification and pose estimation) are 
intimately related, and hence it might be advantageous 
to solve them simultaneously rather than in any par- 
ticular order. For this reason, in this paper we simul- 



taneously address the problems of 3D reconstruction, 
object classification and pose estimation, from a single 
2D image. For simplicity, we will restrict our attention 
to cases in which the input image is known to contain 
a single object from one of several known classes. 



1.3 In defense of shape 

Despite the fact that appearance is often a good indica- 
tor of class identity (see the large body of work relying 
on local features for recognition [12j), there are cases 
in which shape might be a more informative cue. For 
example, there are cases in which the appearance of the 
objects in a class is too variable and/or unrelated to the 
class identity to be of any use, while their (3D) shape 
is well preserved {e.g., consider the class 'mugs'). 

Moreover, shapes can often be extracted very reli- 
ably from videos {e.g., for the important class of fixed 
camera surveillance videos). For example, using exist- 
ing background modelling techniques {e.g., [9,4 ), it is 
possible to compute a foreground probability image en- 
coding the shape of the foreground object in the image. 
This image contains, at each pixel, the probability that 
the foreground object is seen at that pixel. Thus this im- 
age contains only the "shape information" while all the 
"appearance information" has been "filtered out." This 
foreground probability image, which is one of the in- 
puts to our system, can be computed in different ways, 
and our algorithm is independent of the particular al- 
gorithm used to compute it. 

Therefore, though appearance and shape cues are in 
general complementary, for concreteness, we only con- 
sider shape cues. Appearance is only considered in a 
preprocessing step (i.e., not as part of our framework) 
to compute the foreground probability images. Hence, 
since this framework does not rely on detecting local 
features, it can handle featureless objects, or comple- 
ment other approaches that do use local features. 



1.4 The general inference framework 

In order to solve the problems mentioned above simul- 
taneously, while exploiting shape cues and prior 3D 
knowledge about the object classes, we define a proba- 
bilistic graphical model encoding the relationships 
among the variables: class K, pose T, input image /, 
and 3D reconstruction v (described in detail in Be- 
cause of the large number of variables in this graphical 
model (the dimensions of / and v are very high), and 
due to the existence of a huge number of loops among 
them, standard inference methods are either very inef- 
ficient or not guaranteed to find the optimal solution. 
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For this reason we solve our problem using the hypo- 
thesize- and- verify paradigm. In this paradigm one hy- 
pothesis H is defined for every possible "state of the 
world," and the goal is to select the hypothesis that 
best "explains" the input image. In other words, the 
goal is to select the hypothesis that solves 

= argmaxL(i^), (1) 

where H is the set of all possible hypotheses, referred 
to as the hypothesis space^ and L{H) is a function, re- 
ferred to as the evidence^ that quantifies how well each 
hypothesis "explains" the input (better hypotheses pro- 
duce higher values). This evidence is derived from the 
system's joint probability, which is obtained from the 
graphical model mentioned above. 

In the specific problem addressed in this article the 
hypothesis space H contains every hypothesis Hij de- 
fined by every possible object class Ki, and by every 
possible object pose Tj (i.e., H^j = (Ki^Tj)). By select- 
ing the hypothesis Hij that solves ([T]), the hypothesize- 
and-verify approach simultaneously estimates the class 
Ki and the pose Tj of the object in the image. As we 
shall later see, the 3D reconstruction v is estimated dur- 
ing the computation of the evidence. Since the number 
of hypotheses in the set H is potentially very large, it is 
essential to evaluate L{H) very efficiently. For this pur- 
pose we introduced in [14 a class of algorithms to effi- 
ciently implement the hypothesize-and- verify paradigm. 
This class of algorithms, known as hypothesize- and- 
bound, is described next. 

1.5 Hypothesize-and-bound algorithms 

Hypothesize- and-bound (H&B) algorithms have two 
parts. The ffrst part consists of a bounding mechanism 
(BM) to compute lower and upper bounds, L{H) and 
L{H)^ respectively, for the evidence L{H) of a hypoth- 
esis H. These bounds are in general much cheaper to 
compute than the evidence itself, and are often enough 
to discard many hypotheses (note that a hypothesis Hi 
can be safely discarded if L{Hi) < L{H2) for some 
other hypothesis H2). On the other hand, these bounds 
are not as "precise" as the evidence itself, in the sense 
that they only deffne an interval [L{H), L{H)] where 
the evidence for a hypothesis is guaranteed to be. Nev- 
ertheless, the width of this interval (or margin) can be 
made as small as desired by investing additional com- 
putational cycles into the reffnement of the bounds. In 
other words, given a number of computational cycles to 
be spent on a hypothesis, the BM returns an interval 
on the real line where the evidence for the hypothesis 
is guaranteed to lie. If additional computational cycles 



are later allocated to the hypothesis, the BM permits 
to efficiently refine the bounds defining this interval. 

The second part of an H&B algorithm is a focus of 
attention mechanism (FoAM) to sensibly and dynami- 
cally allocate the available computational resources 
among the different hypotheses whose bounds are to 
be reffned. Initially the FoAM calls the BM to compute 
rough and cheap bounds for each hypothesis. Then, dur- 
ing each iteration, the FoAM selects one hypothesis and 
calls the BM to reffne its bounds. This process continues 
until either a hypothesis is proved optimal, or a group 
of hypotheses cannot be further refined or discarded 
(these hypotheses are said to be indistinguishable given 
the current input). Such a hypothesis, or group of hy- 
potheses, maximizes the evidence regardless of the ex- 
act values of all the evidences (which do not need to be 
computed). Interestingly, the total number of computa- 
tional cycles spent depends on the order in which the 
bounds are refined. Thus this order is carefully chosen 
by the FoAM to minimize the total computation. The 
FoAM is explained in greater detail in ^41 §3]. 

H&B algorithms are general optimization procedures 
that can be applied to many different problems. To do 
so, however, a different evidence and a different BM has 
to be developed for each particular problem (the same 
FoAM, on the other hand, can be used in every prob- 
lem). To develop a BM for the current problem, in ^3.4| 
we extend the theory of shapes presented in [14, §5]. 
Understanding this theory will be essential to follow 
the derivations in later sections. 



1.6 Organization of this article 

The remainder of this paper is organized as follows. In 
^ we place the current work in the context of prior 
relevant work. In ^we formally define the problem to 
be solved, defining the evidence L{H) for it. Then in ^ 
and ^we derive formulas to compute lower and upper 
bounds for L{H)^ respectively, and in ^ we describe 
how to implement the BM using these formulas. Af- 
ter that we summarize the proposed framework in ^ 
present experimental results obtained with it in ^ and 
conclude in ^ with a discussion of key contributions 
and directions for future research. Additional details, 
such as proofs of the theoretical results stated in the 
paper, are included in the supplementary material. 



2 Prior work 

As mentioned in ^ in this work we focus on the prob- 
lem of simultaneous 3D reconstruction, pose estimation 
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and object classification, from a single foreground prob- 
ability image. Since in the absence of other constraints 
this problem is ill posed, approaches to solve it must 
rely on some form of prior knowledge about the possi- 
ble classes of the object to be reconstructed. These ap- 
proaches, in general, differ on the representation used 
for the reconstruction, the encoding scheme used for 
the prior knowledge, and the procedure to obtain the 
solution from the input image and the prior knowledge. 

Savarese and Fei-Fei [16 , for example, proposed an 
approach to simultaneously classify an object, estimate 
its pose, and obtain a crude 3D reconstruction from 
a single image. The 3D reconstruction consists of a 
few planar faces (or "parts") linked together by homo- 
graphic transformations. The object class' prior knowl- 
edge is encoded in this case by the appearance descrip- 
tor of the parts and by the homographic transforma- 
tions linking them. Saxena et al. [17 and Hoiem et al. 
[6], on the other hand, focus on the related problem of 
scene reconstruction from a single image. In these works 
a planar patch in the reconstructed surface is defined for 
each superpixel in the input image. The 3D orientation 
of these patches is inferred using a learned probabilis- 
tic graphical model that relates these orientations to 
features of the corresponding superpixels. Prior knowl- 
edge in this case is encoded in the learned relationship 
between superpixel features and patch 3D orientations. 
In contrast with our approach, these approaches rely 
on the appearance of the object (or scene), which as 
previously mentioned, can be highly variable for some 
object classes. 

The use of 3D shape information (or "geometry"), 
on the other hand, has a long tradition in computer vi- 
sion [To] . Since the early days many methods have been 
proposed for the reconstruction and pose estimation of 
"well defined" object classes from a single image. How- 
ever, requiring object classes to be "well defined" often 
resulted in methods that dealt with somewhat artificial 
object classes, not frequently found in the real world 
(e.^., polyhedral shapes [11 and generalized cylinders 
P). In general, these methods proceed by extracting 
geometric features (e.^., corners and edges) from an 
image, grouping these features to form hypotheses, and 
then validating these hypotheses using geometric con- 
straints. One problem with these methods is that it is 
difficult to extend them to handle classes of real objects 
which might be very complex and might not contain 
geometric features at all. A second problem with these 
methods is that they could be very sensitive to erro- 
neously detected features, due to their lack of reliance 
on statistical formulations. 

More recently a number of other methods for 3D 
reconstruction from a single image have been proposed 



for specific object classes {e.g. piTS)). In general these 
methods consist of a parametric model of the object 
class to be represented and a procedure to find the 
best fit between the projection of the model and the 
input image. Prior knowledge in this case is encoded 
in the design of the model (e.^., which parts an artic- 
ulated model has, and how they are connected). Ob- 
ject classes that have been modeled in this way in- 
clude trees/grasses [5| and people [20 . Model-based ap- 
proaches are best suited to reconstruct objects of the 
particular class they were designed for and are difficult 
to extend beyond this class, since the model is typically 
designed manually for that particular class. 

In contrast, more general representations that can 
learn about a class of objects from exemplars (as our 
approach does), can be trained on new classes without 
having to redesign the representation anew each time. 
One example of such a general representation can be 
found in the work of Sandhu et al. [15 , which uses a 
level set formulation coupled with shape priors to seg- 
ment an object in a single image and estimate its pose. 
The prior shape knowledge is learned for an object class 
from a set of training exemplars of the class. To con- 
struct the shape prior Sandhu et al. compute the signed 
distance function (SDF) for each 3D shape in the train- 
ing set, and then learn the principal components of this 
set of SDFs. 

While we consider this work to be the most similar 
to ours regarding its goals, the two approaches have two 
major differences. First, that work is not guaranteed 
to find the global optimum, as our approach does, but 
only a local optimum that critically depends on the ini- 



tial condition (this is further discussed in ^8.2). Second, 



Sandhu et al. do not address the tasks of classification 
or 3D reconstruction. While it could appear that that 
work could be modified to handle these tasks, we argue 
that these modifications are not trivial. For example, 
a 3D reconstruction could be computed from the lin- 
ear combination of SDFs estimated by that framework. 
This is not trivial, however, because a linear combina- 
tion of SDFs is not itself a SDF. Similarly the class 
could be estimated by considering a mixture model, or 
simply running the framework multiple times with dif- 
ferent priors and keeping the best solution. However, 
since the framework has no optimality guarantees, this 
would make the method even more prone to get stuck 
in a local optimum. 

Hence, to the best of our knowledge there are no 
other works focusing on exactly the same problem, ex- 
cept our own work in [13] of which the current work is 
a formalization and extension. There are two major dif- 
ferences between these two works. The first difference is 
that the segmentation and the 3D reconstruction con- 
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Fig. 1 Setup for the problem at hand. The camera is defined 
by the camera center c and a patch O in the unit sphere (O C 
S'^). The world set ^ is defined as those 3D points that project 
to © and whose distance to the camera center is in the interval 
[Rmin, Rmax]- Any 3D point X G ^ projects to a single point 
X in the camera retina. A single object (represented by the 
blue cylinder) is assumed to be in ^. 

sidered in the current work are continuous, while in 
they are discrete. This allows us to compute in this work 
tighter bounds for the evidence L{H)^ based on the the- 
ory of shapes described in [14 . The second difference 
is that the current model corrects a bias discovered in 
the model in [13 . Given two similar 3D shapes with 
equal projection on the camera plane, this bias made 
the framework in [13] select the shape that was fur- 
ther away from the camera. The current model, on the 
other hand, has no preference for either shape. This is 



explained in detail in ^3.4 



3 Problem formulation 

In this section we formally define the problem of joint 
classification, pose estimation and 3D reconstruction 
from a single 2D image, which we alluded to in pre- 
vious sections. This problem is defined as follows. Let 
/ : 6) ^ (c G N) be a 2D image of c-dimensional 
"features" produced as the noisy 2D projection of a 
single 3D object (Fig.[T]). This object is assumed to be- 
long to a class from a set of known classes. Given this 
input image / and the 3D shape priors (defined later), 
our problem is to estimate the class K of the object, 
its pose T, recover a 2D segmentation q of the object 
in the image, and estimate a 3D reconstruction v of 
the object in 3D space. The relationships among these 
variables are depicted in the factor graph of Fig. [2] 

In order to estimate v and the hypothesis H = 
(K^T) given the observations /, we use the maximum 
a posteriori estimator [8| Chapter 11]. Thus we find 
the mode of the posterior distribution, which is given 
by the product of the likelihood function, P{f\q,v,H)^ 
and the prior distribution, P{q^v^H). From the inde- 
pendence assumptions depicted in Fig. [2] the posterior 
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Fig. 2 Factor graph pro- 
posed to solve the prob- 
lem. A factor graph, 2 , has 
a variable node (circle) for 
each variable, and a factor 
node (square) for each factor 
in the system's joint proba- 
bility. Factor nodes are con- 
nected to the variable nodes 
of the variables in the fac- 
tor. Observed variables are 
shaded. A plate indicates 
that there is an instance of 
the nodes in the plate for 
each element in a set (in- 
dicated on the lower right). 
The plates in this graph hide 
the existing loops. 



distribution is equal, up to a constant, to 

P{f\q, V, H)P{q, V, H) = P{f\q)P{q\v)P{v\H)P{H). 

(2) 

Our goal can now be stated as finding the values of 
T, q, and that maximize ([2|. In doing so we would 
be estimating T, and simultaneously. 

Before we formally define each one of the terms 
on the rhs of in ^3.1 we briefly review the the- 
ory of shapes introduced in [14^. In ^3.2[ ^3.3| and ^3.4| 
we use this theory to formally define P{f\q)^ P{q\v) 
and P{v\H)^ respectively. We conclude the section by 
putting these terms together to obtain an expression for 
the evidence L{H) of a hypothesis which is closely 
related to the posterior distribution in 



3.1 Review: continuous and discrete shapes and their 
likelihoods 

In the previous section we mentioned the 2D segmen- 
tation q and the 3D reconstruction v without explicitly 
defining a specific representation for them. These enti- 
ties are two instances of what we call continuous shapes^ 
as defined next. 

Definition 1 (Continuous shape) Given a set i? C 
R^, a set 5 C i? is a continuous shape if: 1) it is open, 
and 2) its boundary has zero measure. Alternatively, a 
continuous shape s can also be regarded as the function 
5 : i? ^ {0, 1} defined by 



s(x) 



A / 1, 

10, 



if X G s, 
otherwise. 



(3) 



In order to define the terms P{f\q) and P{v\H) in- 
volving the continuous shapes q and in ([2|, we define 
the likelihood of a continuous shape s by extending the 
definition of the likelihood of a discrete shape 5, defined 
next. 
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Definition 2 (Discrete shape) Given a partition 
7I(i7) = {i7i, . . . , On} of a set i? C M"^ {i.e., a collec- 
tion of sets such that [j. f2i = f2, and i?^ fl i? j = for 
i ^ j) the discrete shape s is defined as the function 
s : 7I(i7) ^ {0,1}. 

Notice that a continuous shape s can be produced from 
a discrete shape s (denoted as s ^ s) as 5(x) = ^(i?^), 
for all X G i7i and i = l,...,n. Discrete shapes will 
be used in Qto approximate continuous shapes and to 
compute lower bounds for their likelihoods. 

Definition 3 (Bernoulli field) A discrete Bernoulli 
Field (BF) is a family of independent Bernoulli random 
variables B = {^i, . . . , Bn} characterized by the suc- 
cess rates p^(i) = P{Bi = 1). The log-likelihood of the 
discrete shape s according to the discrete BF B is then 
computed as 



iogP(^ = s) ^ ^iogP(A = s{n,)). 



(4) 



For a discrete BF B, we define the constant term 
and the logit function S^{i) to be, 



= ^log(l -p^(i)), and 



i=l 



1 -pbw 



(5) 
(6) 



respectively. Then it can be shown that Q can be 
rewritten as 



logPiB = s) = +J2Hni)SBii)- 



(7) 



i=l 



In order to compute the likelihood of a continuous 
shape, we first define continuous BFs by analogy with 
discrete BFs. Similarly to those, a continuous BF is also 
a collection of independent Bernoulli random variables. 
In a continuous BF B, however, one variable B(x.) is 
defined for each point x G i?, rather than for each index 
i G {1, . . . , n}. The success rates for the variables in the 
BF are given by the function Pb{^) = P{B{'x.) = 1). A 
continuous BF B can be constructed from a discrete 
BF B by defining its success rates to be Pb(x) = p^{i), 
for all X G i7i and for i = l,...,n. This continuous 
BF is said to be produced from the discrete BF and is 
denoted SiS B ^ B. By analogy with Q we define the 
"log-likelihood" of a continuous shape s according to a 
continuous BF B as 



log P(B = s) = — I log P(BM = sM) dx. 



(8) 



where uq \^ constant to be described later, referred 
to as the equivalent unit size. 



If we now define the constant term Zb and the logit 
function 5b for the BF P, respectively as 



Zb = log(l -pb(x)) (ix, and 



Sb{^) = log 



1 -PB(x)y ' 

the likelihood in (Isl) can be rewritten as 



log P{B = s) 



1 

Uf2 



Zb ^ s{-k)Sb{'x.) 



(9) 
(10) 

(11) 



The following proposition shows that, under certain 
conditions, the continuous and discrete "log-likelihoods" 
in (11) and ([7|, respectively, coincide. For this reason 
we said that ( |11| ) extends ([7|. 

Proposition 1 (Relationship between likelihoods 
of continuous and discrete shapes) Let n{f2) be a 
partition of a set Q such that \uj\ = uq \/uj G n{Q), 
and let B and s he a discrete BF and a discrete shape, 
respectively, defined on n{Q). Finally, let B he a con- 
tinuous BF and let s he a continuous shape such that 
B ^ B and s ^ s. Then, the log -likelihoods of the con- 
tinuous and discrete shapes are equal, i.e.. 



log P{B = s) = log P{B = s). 

Proof: Immediate from the definitions. 



(12) 



Note that the equivalent unit size uq "scales" the 
value in brackets in ( pTj ) according to the resolution of 
the partition il(i7), making it comparable to ([7|. Now 
we are ready to define the terms on the rhs of ([2|. In 
the following, when no ambiguity is possible, we will 
abuse notation and write P{s) or P(s(x) = 1) instead 
of P{B = s) and P(5(x) = l), respectively. 



3.2 Image term: logP(/|g') 

The 2D segmentation q that we want to estimate (level 
5 of Fig. |2| is represented as a 2D continuous shape 
defined on the image domain O. This segmentation q 
states whether each point x G 6) is deemed by the 
framework to be in the Background (if ^(x) = 0) or 
in the Foreground (if q(x.) = 1). 

We assume that the state g'(x) of a point x G can- 
not be observed directly, but rather that it defines the 
pdf of a feature /(x) at x that is observed. For example, 
/(x) could simply indicate color, depth, the output of 
a classifier or in general any feature directly observed 
at the point x, or computed from other features ob- 
served in the neighborhood of x. Moreover, we suppose 
that if a point x belongs to the Background, its feature 
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/(x) is distributed according to the pdf Px(/(x)|g'(x) = 
0), while if it belongs to the Foreground, /(x) is dis- 
tributed according to py^{f{x.)\q{x.) = 1). This feature 
/(x) is assumed to be independent of the feature /(y) 
and the state q{y) at every other point y G 0, given 
9(x). 

The subscript x in px was added to emphasize the 
fact that a different pdf might be used for every point. 
In other words, it could be that Px(/oko) 7^ Py(/o|^o) if 
X ^ y and /o and qo are two arbitrary values of / and 
q, respectively For example, in the experiments of ^ 
a different pdf Px(/(x)|g'(x) = 0) was learned for every 
point X in the background. In this case multiple Gaus- 
sian pdf's were used (one per point). On the other hand 
a single pdf, p{f{x.)\q{x.) = 1), a mixture of Gaussians, 
was used for all the points in the foreground (the same 
one for all the points in the foreground). 

Then from ([8| the conditional "log-density" of the 
observed features /, given the 2D shape that we refer 
to as the image term^ is given by 



logPiM^— [ logpx(/(x)|</(x)) dx, 
U0 Je 



(13) 



where the equivalent unit size in 0, iie, is a constant to 
be fixed. Defining the continuous BF Bf with success 
rates 



PBf (x) = 



Px (/(x)|g(x) = l) 



Px(/(x)|^(x) = 0)+px (/(x)|g(x) = l)^ 



(14) 



it follows that ( |13| ) is equal, up to a constant, to the 
"log-likelihood" of the shape q according to the BF Bf^ 
i.e., log P{f\q) = log P{Bf = q) -\- Ci. Therefore, using 



(11), the image term can be written as 



log P{f\q) 



1 

Uo 



^Bf + / <7(x)(^Bj(x) (ix 
Je 



+ (15) 



3.3 3D shape prior term: log P{v\H) 

While the segmentation g is a 2D continuous shape on 
the 2D image domain 0, the reconstruction v (that we 
also want to estimate) is a 3D continuous shape on the 
set ^ C M^. This 3D reconstruction v (level 3 of Fig. 
|2|, states whether each point X G is deemed by the 
framework to be In the reconstruction (if '^(X) = 1), 
or Out of it (if '^(X) = 0). In this reconstruction the 
coordinates of each 3D point X are expressed in the 
world coordinate system (WCS) defined on the set ^. 

As mentioned before our problem of interest is ill 
posed unless some form of prior knowledge about the 
shape of the objects is incorporated. We assume that 
the object class K (level 1 of Fig. |2| is one out of Nk 
distinct possible object classes, each one characterized 



by a 3D shape prior Bk encoding our prior geomet- 
ric knowledge about the object class. This knowledge is 
stated with respect to an intrinsic 3D coordinate sys- 
tem (ICS) defined for each class. In other words, all the 
objects of the class are assumed to be in a canonical 
(normalized) pose in this ICS. Each shape prior Bk is 
encoded as a BF (also referred to as Bk)^ such that for 
each point X' in the ICS of the class, the success rate 
Pb^(X') = P{v\X') = 1\K) indicates the probability 
that the point X' would be In the 3D reconstruction 
defined in the ICS, given the class K of the object. We 
assume that psj^ is zero everywhere, except (possibly) 
on a region <Pk C called the support of the class K. 

Note that the shape prior Bk and the 3D continu- 
ous shape v' alluded to in the previous paragraph are 
defined in the ICS of the class. To obtain the corre- 
sponding entities in the WCS, we define the transfor- 
mation T : ^ that maps a point X' in the ICS 
to the point X = T(X') in the WCS. This transforma- 
tion is referred to as the pose and is another unknown 
to be estimated (level 1 of Fig. [2]). The transformation 
T relates the desired 3D reconstruction v and the hy- 
pothesis BF Bh = Bk,t defined in the WCS, to the 
reconstruction v' and the class BF Bk defined in the 
ICS. Specifically the 3D reconstruction v(X) and the 
success rates of Bh^ p^^(X) = P{v{X.) = l\H) (level 
2 of Fig. |2|, are given by 

v{X)=v'{T-\X)) (16) 
Pb„(X)=pbk(T-1(X)). (17) 

The BF Bh thus encodes the probability that a point X 
is In the reconstruction v. The support of the hypothesis 
H is given by <Ph = T{^k)' 



Therefore, from (11), the conditional "log-probabi- 



lity" of the continuous shape according to the BF 
Bh^ is given by 

logP{v\H) 4 [ logPiBniX) = v{X)) dX = 



1 



u^(T) 



Zb^^ I v(X)5bA^) dX 



(18) 



where the equivalent unit size in u^{T)^ depends on 
the transformation T and is defined below. We refer to 



(18) as the shape prior term. 

In order to derive an expression for the equivalent 
unit size (T) , we want to enforce that the "log-proba- 
bility" of two hypotheses corresponding to objects that 
are similar (ie., that are related by a change of scale), 
and have the same projection on the camera retina, are 
equal. Enforcing this will prevent the system from hav- 
ing a bias towards either smaller objects closer to the 
camera, or bigger objects farther away from the cam- 
era. In Proposition ?? (in the supplementary material) 
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we show that to accomphsh this, the unit size u^(T) 
must be of the form u^(T) = | J(T)|/A^ where \J{T)\ 
is the Jacobian of the transformation T, and A' > is 
an arbitrary constant. 



3.4 Projection term: log P{q\v) 

The segmentation q and the reconstruction defined in 
previous sections, are certainly not independent, as we 
expect q to be (at least) "close" to the projection of v on 
the camera retina. The projection term, log P{q\v), en- 
codes the "log-probability" of obtaining a segmentation 
q in the camera retina 0, given that a reconstruction 
V is present in the space in front of the camera In 
order to define this term more formally, we first need 
to understand the relationship between the sets O and 

encoded by the camera transformation. 
The camera transformation maps points from the 
3D space ^ into the 2D camera retina (Fig. [T]). For 
simplicity, we consider a spherical retina rather than a 
planar retina. In other words, the image domain O is 
a subset of the unit sphere {O C S'^). Given a point 
c G M^, referred to as the camera center, a correspon- 
dence is established between points in <P and O: for 
each point x G 0, the points in the set i^(x) = {X G 

: X = c + rx, r G [0, oo)} are said to project to x. 
This set is referred to as the ray of x. Considering this 
correspondence, we will often refer to points X G ^ 
by their projection x in and their distance r from 
the camera center, as in X = (x, r). The domain <P is 
thus formally defined as the set of points in 3D world 
space that are visible in the input image and are at 
a certain distance range from the camera center, i.e., 

^^{{^,r)eR^:^e0, Rmin < r < Rmax} (Fig. 

As mentioned at the beginning of this section the 
shape q is di "projection" of the continuous 3D shape v 
in ^ onto 0. In other words, the state q{x.) (Foreground 
or Background) of the shape g at a point x G only 
depends on the states of the shape v (In or Out) in the 
ray i^(x), and not on the states of v in other points of 
^. To emphasize this fact we will write P {q{x.)\v) = 
P (g(x)l'y^(x)) 7 where Vjn^^^^ denotes the part of v in 
i?(x) and is referred to as the shape v in the ray i?(x). 
Note that '^/^(x) • ^(x) {0, 1} is itself a ID continu- 
ous shape defined by 'yi?(x)(^) = '^(c + rx). 

Then given a 3D continuous shape v in <P our goal 
is to define a BF Bg for g in (level 4 of Fig. [2|, 
by "projecting" the 3D shape v into 0. For notational 
convenience, given a point x G 0, its ray i^(x) and the 
shape V in this ray, vj^^^-^ , let us define the failure rate 
of the BF Bg as 



This function, referred to as a projection function, en- 
codes the probability of "seeing" the Background {i.e., 
not the shape v) at x given the shape '^/^(x) in R{'x.). 

We could simply define a projection function to be 1 
when the measure of the set 'yi?(x) is strictly zero (recall 
that continuous shapes can also be considered as sets, 
in this case Vj^^^-^ is the set {x G i^(x) : v{x.) = 1}), 
and to be otherwise, yielding the natural projection 
function 

A / 1, if |^i?(x)| = 0, 

0, otherwise. 



^Natural 



(Wij(x)) 



(20) 



However, this projection function leads to solutions that 
are not desirable, since these solutions can "explain" 
the parts of the image that are much more likely Fore- 
ground than Background {i.e., those where Px(/(x)| 
g(x) = l) px(/(x)|g'(x) = 0)) by placing an infinites- 
imal amount of "mass" in the parts of 3D space that, 
according to the shape prior, are much more likely to 
be Out of the reconstruction than In it {i.e., where 
P{v{X) = 0\H) > P{v{X) = 1\H). Note that because 
the amount of mass placed is infinitesimal, these solu- 
tions do not "pay the price" of living in the unlikely 
parts of 3D space, but still "collect the rewards" of liv- 
ing in the likely parts of the 2D image. In order to avoid 
these undesirable solutions, we derive next, from first 
principles, a new expression for projection functions. 

Intuitively we expect ^('^i?(x)) to be lower when the 
measure of the set '^/^(x) is larger {i.e., the larger the 
part of the object that intersects the ray, the least likely 
it is to see the background through that ray). In other 
terms, given two shapes in the same ray R, v]^ and vj^, 
such that v}^{r) < v\{r) Vr G [Rmin.Rmax], we expect 



{<) > 9 (4) 



(21) 



This property of the projection function, referred to as 
monotonicity, guarantees that reconstructions that are 
intuitively "worse" are assigned lower log-probabilities. 

While there are many constrains that can be im- 
posed to enforce monotonicity, we will require that pro- 
jection functions satisfy the independence property de- 
fined next. In addition to monotonicity, this will yield 
a simple form for the projection function that has an 
intuitive interpretation. 

Definition 4 (Independence) Given two continuous 
shapes in the same ray R, v]^ and v\, that are disjoint 
{i.e., = 0), a projection function g is said to 

have the independence property if 



(22) 



In words, (22) states that the events that the back- 



g (^i?(x)) = P (^(x) = 0|vi^(x) 



ground is occluded by one shape or the other are inde- 



(19) pendent. It can be seen that (22) implies (21). 
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Next consider two 3D continuous shapes and 
v'^ related by a central dilation Tc of scale 6* > 1, 
whose center is at the camera center c (i.e., '^^(X) = 
v2(Tc(X)) and Tc(X) = c+5'(X-c)). These two shapes 
are similar (i.e., they are equal up to a change of scale) 
and produce the same projection on the camera retina, 
even though is smaller and closer to the camera than 
v'^ . In this situation we would like the framework to be 
agnostic as to which shape is present in the scene. Oth- 
erwise the framework would be either biased towards 
smaller shapes that are closer to the camera, or to- 
wards larger shapes that are farther away from it. This 
behavior is enforced by requiring projection functions 
to be scale invariant, as defined next. 

Definition 5 (Scale invariance) Given any shape in 
a ray vr^ a projection function g is said to be scale 
invariant if 

9{vR{r))=g{vR{Sr)), V5 > 0. (23) 
The following proposition provides a family of func- 



tions that satisfy the desired requirements (22) and 



(23) 



Proposition 2 (Form of the projection function) 

Let us denote by cl shape in a ray that consists 

on the interval {u^w), and let g be an independent and 
scale invariant projection function that satisfies the con- 
dition 

^(l(i,e)) = e", for some a <0. (24) 
Let V be a 3D continuous shape, and let 



Jo r Jo 



v{c + rx) 



4(x) = / ' dr = I ^r (25) 

r Jo r 

be a measure of the ''mass" in the ray i?(x). Then the 
projection function g must have the form 



(26) 



Proof: See proof in the supplementary material. □ 



It is interesting to note in ( 26 ) that the scalar quantity 



^^(x) summarizes the relevant characteristics of the ID 
shape Vi^(x) • For this reason we will abuse the notation 
and write P{q{x.)\vR(^^^) = P{q{x.)\iy{x.)). 

Note that the natural projection function defined 



in (20) also satisfies the independence and scale invari- 



ance requirements defined before. Moreover, the natural 
projection function is an extreme of the family of func- 
tions defined in Proposition |2] (for a — oo). For the 
reasons previously described, however, and as we em- 
pirically observed, this function is not convenient for 
our purposes. 



Using Proposition [2] we can now write an expression 
for the projection term as 

logP(9(x)|t;) = logP((z(x)|t;fl(,)) = logP(9(x)|£„(x)) 



.log(l- 



if q(x) = 0, 
, if9(x) = l. 



(27) 



Therefore, using ([8|, the projection term is given by 

logP(gl^) = — / logP(g(x)|4(x)) dx, (28) 
U0 Je 

where uq is the equivalent unit size in O (the same that 
appeared in (IT5| before). 



The definition of the projection function in (26 ) con- 



trasts with the choice made in [13], where shift invari- 
ance rather than scale invariance was imposed. That 
choice biases the decision between two similar shapes 
with equal projection on the camera retina towards the 
shape that is further away from the camera. 



3.5 Definition of the evidence L{H) 

In previous subsections we derived expressions for each 
of the terms in the system's posterior distribution given 
in ([2]). In this section we put them together to find an 
expression for the evidence L{H) of a hypothesis H. 
This is the expression that the system will optimize. 

Substituting the expressions for the image term, 
logP(/|(7), the shape prior term, log P('L'|i^), and the 



projection term, log P{q\v) (given by ([15]), (18), and 



(28), respectively) into ([2|, the log-posterior is given 

by 

log P{f\q, V, H) + log P{q, v, H) = log P{H)+ 
1 



U0 



Zbj + I qix)SBf (x) dx 
Je 

- [ logP((z(x)|4(x)) dx+ 
U0 Je 



A' 



\JiT)\ 



w(X)(5b„(X) dX 



(29) 



Our goal can now be formally stated as solving 
supq^^fj[\ogP{f\q,v,H) + \ogP{q,v,H)], which is 
equivalent to solving maxn L\H)^ with 

L\H) ^ sup \\ogP{f\q,v,H)^\ogP{q,v,H)\ (30) 

q,v L 

(the supremum is used in optimizations over q and v 
since the set of continuous shapes might not contain a 



greatest element). However, instead of computing (30) 



directly, we will first derive an expression that is equal 
to it (up to a constant and a change of scale), but is 
simpler to work with. 
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In order to derive this expression, we disregard the 
terms Ci and Zb^/uq that do not depend on or 

disregard the term log P{H) that is assumed to be 
equal for all hypotheses, rearrange terms, multiply by 
1^0, define A = and substitute (29) into (30), to 



obtain the final expression for the evidence. 



L{H) 



- sup 

q,v 



9(x)(5B,(x)+logP(q(x)|4(x))- 



/' 



\J{T)\ 
\JiT)\}- 



(31) 



In finding the hypothesis H = {K^ T) that maximizes 
this evidence we are solving the classification problem 
(because K is estimated) and the pose estimation prob- 
lem (because T is estimated). At the same time approx- 
imations to the segmentation q and the reconstruction 
V are obtained for the best hypothesis H. By construc- 
tion the approximation for is a compromise between 
"agreeing" with the shape prior of the estimated class 
K when this prior is transformed by the estimated pose 
T, and "explaining" the features observed in the input 
image /. 

As mentioned before, however, computing the evi- 



dence L{H) for each hypothesis H using (31) would be 
prohibitively expensive, because of the large number of 
pixels and voxels that need to be inspected to compute 
the integrals in that expression. For this reason we in- 
stead compute bounds for it and use an H&B algorithm 
to select the best hypothesis. In the next two sections 
we describe how to compute those bounds. 

4 Lower bound for L(H) 

In this section we show how to efficiently compute lower 



bounds for the evidence L{H) defined in (31). Towards 



this end we first briefly review in ^4.1 the concept of a 
mean-summary from [14] and two result concerning it. 
Then, in §4.2[ we use these results to derive the lower 
bound for L{H). 

4.1 Review: mean-summaries 

Definition 6 (Mean-summary) Given a BF B de- 
fined on a set i? and a partition n{f2) of this set, the 
mean- summary is the functional Yb = V^b,oj}^^]j(^q^ 
that assigns to each partition element cu G n{f2) the 
value Yb,uj^ defined by 



(32) 



The name "summary" is motivated by the fact that 
the "infinite dimensional" BF is "summarized" by just 
n = |il(i7)| values. 

Mean-summaries have two important properties: 1) 
for certain kinds of sets uj G n{f2)^ the values Yb,oj in 
the summary can be computed in constant time, re- 
gardless of the "size" of the sets uj (using integral im- 
ages [19 , see [10 in the supplementary material); and 



2) they can be used to obtain a lower bound for the 
evidence. 

It can be shown that the BFs that produce a given 
summary Y form an equivalence class. With an abuse 
of notation, we will use B ^ Y to denote the fact that 
a BF B is in the equivalence class of the summary Y. 
Next we prove two results that will be used to obtain a 
lower bound for the evidence. 

Lemma 1 (Mean-summary identity) Let 7I(i7) be 
a partition of a set Q, let s he a discrete shape defined on 
n{Q), let B he a BF on Q, and let Yb = {Yb,u;} ^ 
n{Q) ) he the laedJi-summary of B in 11 {Q). Then, for 
any continuous shape s ^ s, it holds that 



/ (5b(x)5(x) = ^ s{uj)Yb,uj- 

u;en{f2) 

Proof: Immediate from the definitions. 



(33) 



Lemma 2 (Relationship between the sets of con- 
tinuous and discrete shapes) Let n{f2) he a par- 
tition of a set Q, let S(i7) he the set of all continu- 
ous shapes in Q and let §>{n{f2)) he the set of all dis- 
crete shapes in LI{Q). Then the set of continuous shapes 
that are produced hy any discrete shape in S{n{f2)), 
§(7I(i7)) = |5: 5 - 5,5 G S(il(i7))|, is a suhset of the 
set of all continuous shapes in Q, i.e., E>{n{f2)) C S(i7). 

Proof: Immediate from the definitions. □ 



4.2 Derivation of the lower bound 

In [14 we showed how to compute lower bounds for 
expressions that were much simpler than the evidence 



in (31), by relying on partitions. To compute bounds 



for (31) we will also rely on a partition, namely, the 



standard partition. Thus, we define next the standard 
partition of (O^^) and then proceed to derive the for- 
mulas for the bounds. 

Definition 7 (Standard partition) Let O C S'^,^ C 
R^, and 71= [Rmim Rmax] be three sets such that ^ = 
X 7^ (as in Fig.[l|. Let 71(0) = {0i, . . . , ©at^} be a 
partition of and let II{IZ) = {[ro,ri), . . . , [rAr^-i, tat^)} 
be a partition of IZ such that 
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Fig. 3 The standard partition for {0,<P), where <P = © x 

n, n ^ [Rrr^ir^,Rrr^axl H (0) = {01,02,03,04}, and 

77(7^) = {[Rmin.ri), [ri, Rmax)}- For clarity only the voxels 
^1,1,^2,2,^3,1 G 77(^) are shown (^1,1 = 0i x [Rmin.ri), 
^2,2 = 02 X [ri,Rmax), and ^3,1 = 03 x [Rmin,ri)). 

Rmin = ro <ri < <rN, = Rmax and (34) 

= pVi-i = /3Vo, for some /3 > 1. (35) 

The standard partition for (0, <P) is defined to be {n{0)^ 
7I(^)), where n{^)^n{0) x 7I(7^) = {^1,1, ^1,2, . . . , 
^No,nA with ^^-^i = 6)^- X [ri_i, r^) (Fig.|3|. Notice that 
given an arbitrary partition for the set O and a par- 
ticular partition for the set 7^, the standard partition 
defines a partition for the set = x 7^. 

In the next theorem we derive an expression to 
bound L{H) from below. The main observation in this 
theorem is that, according to Lemma [2j the supremum 
in ^ for q e S(il(6))) and v G S(7I(^)) is less than 
the supremum for q G E>{0) and v G S((^). Moreover, 
since the continuous shapes in S(7I(0)) and S(77(^)) 
are constant inside each partition element, evaluating 
this new supremum is easier. 
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Theorem 1 (Lower bound for L(H)) Let U = 

{n{0)^n{<P)) be a standard partition and let Yf = 

and Yh = \ (b\ be the mean- 

I 6'G7T(0) I '^J</)G77(^) 

summaries of two unknown BFs in n{0) and n{^), 
respectively. Let ipj^k be the set of the indices of the k 

largest elements of ^YH,^j^i , Yh^^j 2 •>••••> ^H.^j^Nr- }^ 
let ^j^k be the sum of these elements, i.e., 



(36) 



Then, for any BF Bf ^ Yf and any BF Bh ~ Yh, 
it holds that L{H) > Lji{H), where 



\7 ^'^ 
Ln{H)^^^+Y,Cs^{H), and 



(37) 



|0,|logP(q|n,log/ 



\J{T)\ ' 



(38) 



Moreover, the 3D reconstruction and the 2D seg- 
mentation corresponding to this hound are given by the 
discrete shapes v and q, respectively, defined by 



~ \ 0, oth 



therwise. 



^(0j) = argmax 



qe{0,l} 



ql>,0^. + |0,|logP(q|n*log/3) 



(39) 



(40) 



where n* is the solution to (|38). 



Proof: From Lemma^ it holds that L{H) (defined 
(31)y) is greater than or equal to 



^ + max < sup / 

)| q,v [ Jo 



5b f (x)(7(x) 



\J(T)\ jn 



r 



r^'y(x, r)(5Bjf (x, r) dr^ 



logP(g(x)|4(x)) 



dx 



}■ 



(41) 



Since q^q and v^v, it follows from Lemma^ that 

. N& 

/ Sb, (x)^(x) c^x = V g {Oj) Yf^e, , and (42) 
Je ~: 



Jo JRrmn 



(x,r)(5B^(x,r) dr dx = 

N& Nr 

j=i i=i 



(43) 



On the other hand, iv{^) '^s constant inside each ele- 
ment of n{0), because Vx G 0j, 



4(x) = / 

rn 



'u(x, r) 



Nr. 



dr = '^v{^j^i)\og 



i=l 



ri-1 



Nr 



= log(/?)E^^(%)- 



(44) 



i=l 



Then, substituting (|42|, (|43|) and (|44|) into (|41|, 
that expression reduces to 



Ne 

E 



Nr 



+ \ej I log p U (Oj) log (/?) E V + 



+ ^7(^E*(%i)^^.*... 



(45) 
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Note that the first (leftmost) term inside the square 
brackets in ( [45| ) does not depend on the states of the 
voxels v{^j^i), the third term does not depend on the 
state of the pixel q{0j), and the second term does not 



on which particular voxels are full, hut on the 
number of full voxels Wj = Y^fZi^ - contrast, 
the third term does depend on which voxels are full, and 
given the number r\j of full voxels, the third term will be 
maximum when the r\j voxels with the largest summary 
Yh,^ i CL'i^e full. In this case the third term takes the 



value defined in (36). Therefore (45) is equal to. 



\JiT)\ 



Emax < 



max 

qe{0,l} 



(46) 



|0,|logP(q|n,log/3) 



A 

\J{T)\ ^'"^ 



which can be rearranged to yield (^37b and (38). That 



the discrete shapes v and q ( defined respectively in ( 39 ) 



and (40)y) maximize (38) follows immediately, proving 
the theorem. □ 

Note that this bound is computed in 0{NoNr log A^^)- 



5 Upper bound for L(H) 

In this section we show how to efficiently compute upper 



bounds for the evidence L{H) defined in (31). Towards 



this end we first briefly review in §5.1| some concepts and 
results from [14 , and then in ^5.2| we use these concepts 
and results to derive the upper bound for L{H). 

5.1 Review: semidiscrete shapes and m-summaries 

In Qwe used discrete shapes and mean-summaries to 
get a lower bound for the evidence. Analogously, in 
this section we define and use the concepts of semidis- 
crete shapes and m-summaries to get an upper bound. 
Like discrete shapes and mean-summaries, semidiscrete 
shapes and m-summaries "condense" the "infinite di- 
mensional" continuous shapes and BFs, respectively, 
into a finite set of real numbers. 

Definition 8 (Semidiscrete shape) Given a parti- 
tion 77 (i?) of a set i? C M^, the semidiscrete shape s 
is the function s : n{f2) R that associates to each 
element cu G n{f2) a value in the interval [0, \uj\]. Given 
a continuous shape s and a semidiscrete shape s we say 
that both shapes are equivalent (denoted as 5 ~ 5) if 
s{uj) = |5 n cjI Vcj G n{f2) (i.e., if the measure of s in 
each set uj is equal to s{uj)). Informally the semiscrete 
shape s "remembers" how much of each partition ele- 
ment is occupied by the shape s, but "forgets" the state 
5(x) of each particular point x G cj. 



The following lemma explores the relationship be- 
tween the sets of continuous and semidiscrete shapes. It 
simply states that, for any partition, every continuous 
shape is equivalent to some semidiscrete shape in the 
partition. 

Lemma 3 (Relationship between the sets of con- 
tinuous and semidiscrete shapes) Let n{f2) be a 
partition of a set Q, let S(i7) be the set of all continuous 
shapes in Q and let S(i7(i7)) be the set of all semidis- 
crete shapes in 7I(i7). Then 

\^se^{Q) :5-5,5G§(iI(r2))} = ^{Q). (47) 
Proof: Immediate from the definitions. □ 

Definition 9 (m-summary) Given a BF B defined 
on a set Q and a partition TI{Q) of this set, the m- 
summary is the functional Yb = {^b,u;}^^77(^) that 
assigns to each partition element uj G 7I(i7) the (2m + 
l)-dimensional vector Yb^u = \Yb^ ^ . 
components are defined by 



Y^J, whose 



X G : (5b(x) < \ 

m J 



(48) 



for j = — m, . . . , m. In other words, the m-summary el- 
ement Yb^oj "remembers" how the values oi ^b in the 
set UJ are distributed, but "forgets" where those val- 
ues are within the set. More specifically, the quantity 
{Y^^^ —Y^^) indicates the measure of the subset of uj 
whose values oi 5b are in the interval [j5max/'^-,{j + 
l)5max/'^)' Given a BF B and an m-summary Y de- 
fined on a partition TI{Q)^ we say that they are equiv- 
alent (denoted as 5 ~ F) if they satisfy (48) for each 



set UJ G n{Q). Throughout this work we use m = 6. 

M-summaries, like mean-summaries, have two im- 
portant properties: 1) for certain kinds of sets uj G 
77 (i7), the values Yb^uj in the summary can be com- 
puted in constant time, regardless of the "size" of the 
sets UJ (using integral images [19 , see ^Tojin the supple- 
mentary material); and 2) they can be used to obtain 
an upper bound for the evidence. Lemma [4j below, will 
be used to obtain this upper bound. This lemma tells 
us how to bound the integral (5b(x)5(x) when the 
BF B is only known to be equivalent to an m-summary 
F, and the continuous shape s is only known to be 
equivalent to a semidiscrete shape s. 

Lemma 4 (m-summary bound) Let n{f2) be a par- 
tition of a set Q, let s be a semidiscrete shape defined 
on LI{Q), and let Y = {K;}c^ei7(i7) be an m-summary 
in LI{Q). Then for any continuous shape s on Q such 
that 8 ^ s, it holds that 



sup 

Br^Y 



Sb{'x.)s{x.) 



< 



E - 



Xrsioj)), (49) 
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where B is a BF, 



Step 2. The second term in (31 ) can also be bounded in 



Y,oo 



(S) 



m 



m—l 



j=JiS) 



■J{S) (Yj-\co\ + s) 



and 



J{S)^mm{j:Y^>\co\ 
Proof: See m, Lemma 1]. 



(50) 



(51) 



□ 



Intuitively, the continuous shape s that yields the 



terms of the shapes q and v. In Lemma [5] we will show 
that it is possible to write in closed form (up to a per- 
mutation of the rays within each partition element) the 
continuous shapes q^ and that maximize this term 
and are respectively equivalent to q and v. We denote 
this fact as = fi{q) and = /2(<7, '^). Therefore, 
for each pair of semidiscrete shapes q and we do not 
need to consider all possible continuous shapes q and v 
such that q ^ q and v ^ we only need to consider 
fi{q) and f2{Q,v). Hence, the optimization problem in 



supremum of the integral in ^ contains the parts of & ^an be simplified to a problem of the form 



each uj where (5^ is greatest and has a "mass" of s{uj) 
in each cu G n{f2). Hence, the supremum within each 
set cj, Fy ^{s{uj)), is obtained by adding the "mass" 
of the subset of u where the value of (5b is in the in- 
terval [jSmax/'m^ {j + l)^max/^), timcs the maximum 
value of Sb in this interval, {j -\-l)Smax/'f^- These terms 
are added in descending order of j until the total mass 
allocated in uj is s{uj). 

5.2 Upper bound for L{H) 

In this subsection we will derive a formula to compute 



an upper bound for the evidence L{H) defined in (31). 
Towards this end recall that L{H) is computed by solv- 
ing an optimization problem of the form 



L{H) = snpEi{Bf,BH,q,v), 

q,v 



(52) 



where Bf and Bh are two BFs obtained from the input 
image / and the hypothesis respectively, and q and 
V are two continuous shapes. To derive the bound, we 
proceed in three steps. 

Step 1. We reduce the "amount of information" to be 
processed in the computation of L{H) by considering 
not only the given BFs Bf and Bh^ but also all the 
BFs B'f and B'h that have the same m-summaries Yf 
and Y//, respectively. In other words, Bf ^ Yf ~ B'f 
and Bh ~ Yh ~ B'h- In doing so we obtain an upper 
bound for L{H)^ 

L{H) < s\ipE2{Yf,YH,q,v), where (53) 
E2{Yf,YH,q,v)= sup E^{B' f.B'n.q.v). (54) 

B'h^Yh 

Therefore, we can disregard the details about each BF 
and only consider the information in their m-summaries. 
Moreover, using Lemma [4j we can bound the first and 
third terms in (31) as a function of the semidiscrete 



shapes q and v defined on the standard partition [n(0)^ 



max E2{Yf,YH,fi{q),f2{q,v)), 



(55) 



i.e., to estimate a pair of semidiscrete shapes rather 



than a pair of discrete shapes. This means that in (55) 



only a finite number of quantities needs to be esti- 
mated {q{Oj) and v{^j^i)^ for j = l^...^No and i = 
1, . . . , A/"^), whereas in ([53]) an "infinite number of quan- 
tities" needs to be estimated (g(x) and '^(X), for x G 
and X G 



Step 3. As it turns out, the problem in (55) can be 



further simplified because it is possible to efficiently 
compute the optimal semidiscrete shape that cor- 
responds to any semidiscrete shape q. We denote this 



fact as = /3(g). Therefore the problem in (55) can 
be simplified into a problem of the form 



max E2{Yf,YH,h{q),f2{q, h (q) ) ) • 

q 



(56) 



That is, for each element 0j of the partition 77(0), we 
need to solve, independently, an optimization over the 
scalar parameter = q{0j)' Each of these optimiza- 
tions is solved using grid search. 

The functions /i and /2 mentioned in Step 2 above 
are informally defined in Fig. [4] Given the semidiscrete 
shape (7, the function /i returns a continuous shape q^ 
in O such that ^ q. From Lemma |4j any continuous 
shape in the set {g : g ~ g} is "equally good," providing 
the same bound for the first term in ( [31] ). This is what 
we meant by "up to a permutation of the rays within 
each partition element." 

The function /2 is somewhat more complex. To un- 
derstand it, we need to define the sets 0^, 0], ^P^^ and 
Given the continuous shape q^ in 0, the sets 0^ 
and are the parts of pixel Oj where g*(x) = and 
g*(x) = 1, respectively, and ^^^^ and ^^j^ are the parts 
of voxel ^j^i that project to 0^- and 0j, respectively 
(Fig. [4| . More formally. 



0,^^{xG0, :g*(x)=0}, 
0]^{xG0, :g*(x) = l}. 



(57) 
(58) 



14 



P°2=P°1 



H3 




Fig. 4 Continuous shapes (in blue) and -u* (in green) that 
maximize ( |70| ) when the mass in a pixel O is constrained to 
be q{0) and the mass in each voxel is constrained to be 
v{^i) (for convenience the pixel subscript j has been omit- 
ted), and are, respectively, the parts of the pixel O 
where g*(x) = and q'*(x) = 1. The side of each voxel 
that projects to , ^j, is filled first. In this part the mass 
is concentrated on the inner side of the voxel, between r^-i 
and p] . Only when this part is full, the other part of the voxel 
{i.e., the part that projects to O^), starts to be filled. On 
that part of the voxel the mass is concentrated in the outer 
part, betwee n and and are computed using the 

formulas in ^67l and (|68l). 



(59) 
(60) 



^l^ = 0^x[u_,,u], and 



The continuous shape returned by /2 has a "mass" 
of v{^j^i) inside each voxel ^j^i (because ^ v). As 
demonstrated in Lemma [5] below, within each voxel 
the mass is preferentially allocated on (^j^ rather than 
on To make this more precise, let us define the 
quantities vj^ and v^^ to be the mass of in - and 
respectively, that is 



J,* 

yl^\{Xe^l:v4X) = l}\. 



= \{X € <P]i : v^X) = 1}\ , and 



(61) 
(62) 



Since the mass is allocated preferentially in rather 
than in these quantities can be computed with the 
formulas 



v]^, ^min{i;(^,-,),|^),,|} and 
vl, ^ max{i;(^,- - |^],,|, 0} = i;(^,, ,) - v 



(63) 
(64) 



It is also proved in Lemma js] that the mass in (^j^ 
lies on the inner side of the voxel, while the mass in 
(^^^ lies on the outer side of the voxel (Fig.|4|. To make 
this statement more precise, recall that = q{Oj) 
and define the quantities pj,^ = pj,i(c|j,vj J and p^,^ = 
p^^(qj,v^ J to be the radius of the outer part of 
that is full, and the the radius of inner part of ^^^^ that 
is full, respectively. In order to compute these quanti- 
ties, we define the voxel volume function T. Given a 



voxel (j) = 6 X [po, Pi) defined by the Cartesian product 
between a set in the camera retina and an interval 
[po^Pi) ill the real line (as in Definition [t]) , its volume 
is given by T(|^|, po, pi), where |^| is the solid angle on 
the camera retina subtended by (which is equal to its 
measure) and the function T is defined by 



r (a,po,pi) = - (pi^ 



(65) 



Using this definition the volume of a voxel in the 
standard partition is T(|0j|, r^_i,r^), the volume of 
the set ^^^^ {k e {0, 1}) is 

l^-,.l=^(|^-|,r.-i,rO, (66) 
and the mass of in ^ and #° j is, respectively, 



vj,i = ^{^j,ri-i,p],i), and 
v% = T{\Oj\-q„p%,n). 



(67) 
(68) 



Hence, given and v{^j^i)^ it holds that = q^, 
can be found using (66), vj^ and v^^ can be found 
using (63) and (64), and pj^ and p^^ can be found as 
the solutions to (67) and (68). 



The previous statements about are proved in the 
following lemma. For notational convenience we group 
the quantities {v^^} and {vj that project to the same 
pixel Oj into the vectors 

^ [yO 1 , . . . , and Vj ^ [vj,i , . . . , v]_^J . (69) 

Lemma 5 (Upper bound for the projection term) 

Let {n{0)^ he a standard partition and let q and 

V be two semidiscrete shapes in 71(0) and n(^), re- 
spectively. Let q and v be two continuous shapes in O 
and ^, respectively, and define 



Ae{q,v)^ j logP(9(x)|4(x)) dx, 
J0 



(70) 



with the integrand of this expression defined as in (27). 
Then: 

1. Any continuous shape q^ such that q^ ^ q and the 
continuous shape defined by 



IzfXe U,- , 0] X [r,_i, p^ J U 0? X [pO 



- L' o-^^rj,', 

otherwise. 



(71) 



(Fig. where the quantities ^ , ], ri, pj, and p^,^ 
are defined in (I57|)-(l58|); (ISSl); (671 and (|68|); respec- 



tively, are a solution to the problem 
A0{q,v) = sup Aoiq.v). 



(72) 



qr^q 

Vr^V 
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2. The optimal value can he computed as 

- q,)logP = OK°(q„ V°)) 

3 

q,\ogP{q=l\i\q„y])) 



where 



nq.,V°)^^log J' and 



f}\( A 



Proof: See proof in the appendix. 



(74) 

(75) 
□ 



Notice that the quantities {q^} {j = l^...^No) 
simply relabel the quantities in the semidiscrete shape 
q. Notice also that v{<Pj^i) = v^^ + vj^ for every ele- 
ment G n{<P). Therefore, the quantities {v^^ + vj^} 
{j = 1, . . . , Ne^ i = 1, . . . , Nr) define the semidiscrete 
shape V. This notation emphasizes the fact that to es- 
timate the semidiscrete shape we need to estimate 
No scalar quantities, and to estimate the semidiscrete 
shape i), we need to estimate 2NoNr scalar quantities. 
These quantities are estimated in the process of com- 
puting the upper bound in the next theorem. 

Theorem 2 (Upper bound for L(H)) Let U = 

(77(0), 7I((?)) be a standard partition and let Yf = 

\Yf er cii^d Yr — \ Yh sr be the m-su- 

l J een{e) I J cf)en{^) 

mmaries of two unknown BFs in 77(0) and 11 {^), re- 
spectively. Then for any BFs Bf ^ Yf and Bh ~ Yh, 
it holds that L{H) < Lu{B[), where 



\JiT)\ 



Ce,{H) ^ max \Fy e,{^i) + r^is\j) 
and rj{c\j) is the solution to the problem 



(76) 
(77) 



^i(qi) = 



f supvo,v]7i (qi>v°,v]) , 

subject to: 



< < 
^ ^ 



#1 , 



,Nr), 



(78) 



with 



10,- 



ia£0(q,-,V°, 



q,log(l-e<(^-V^'))- 



Nr. 



\AT)\ ^ 



E^yH,<^,.K°.+-i^) (79) 



Proof: It follows from Lemma ^ and from (53) that 
L{H) is less than or equal to 



XZf 



(73) 



Bh^Yh 

A 



sup 


sup / 






L qr^q 







q{x)SBf (x) + 



/ r'^v 

JRmir^ 



\JiT)\ 
logP(9(x)|4(x)) 



(x, r)<5B„ (x, r) dr+ 



dx 



(80) 



Exchanging the order of the sup and max operations, 



the second term in (|80|) is equal to 

Jbj(x)(7(x) 



max < sup sup 



A 



\J{T)\ 



sup 

BHr^YH 



/ / r'^v{^.,r)5BH{^->^) dr d-^ 

Jo JRmiri 

^logP(^(x)|4(x)) dxj I. 



(81) 



Using Lemma^ the following inequalities are obtained 
for the first and second terms in (81): 



sup 

Bfr^Yf 



sup 



SBf{'x.)q{-K) d:sL 



Ne 



^E^v^/,e,(q.), (82) 



P pRmax 

I I r^'u(x, r)^B/j (x, r) dr dx 

J G J Rmin 

Ne Nr. 

i=l i=l 



< 



(83) 



And since the rhs^s of (82) and (83) do not depend on 
q or V, these terms can be moved out of the supremum 
on q and v in ( [sT] ). 

In Lemma\o\on the other hand, we have shown that 



the last term in (81), referred to as Ao.{q^v), can be 
computed explicitly, and an expression for it is given 



in (73). Substituting (82), (83) and (73) into (81) and 



rearranging, we obtain 



(84) 



sup 



Nr. 



(i^ and i] are respectively defined in (74) and ([75|) ). equal to (76), completing the proof. 



which using the definitions in ( |78| ); ^79), and (77) is 

□ 
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This theorem says that to compute the upper bound 



in (76) we must maximize the function of the scalar 



variable defined in (77), for each pixel Oj G 11 {O). 



This function is the sum of two terms. In ([50| we pro- 
vided a formula to efficiently compute the left term. We 
will show in §?? (in the supplementary material) how 
to efficiently compute the other term (defined by ([78]) 



and ([79|)). 

The semidiscrete 2D shape ^, encoded by the quan- 
tities {c\j} estimated in the previous theorem, is the 2D 
segmentation of the object in the image sphere. Simi- 
larly, the semidiscrete 3D shape encoded by the quan- 
tities {v^ • +vj^} estimated in the previous theorem, is 
the 3D reconstruction of the object. Note that this seg- 
mentation/reconstruction pair corresponds to the up- 
per bound. Recall that another such pair, encoded how- 
ever by discrete shapes, was obtained jointly with the 
lower bound in Theorem [T] 

This concludes the derivation of the upper bound. 
In the next section we show how to use this bound and 
the lower bound in Theorem [l] to solve our problem of 
interest. 



6 Bounding mechanism 

Given a standard partition (77(0), il(^?)), theorems [l] 
and [2] describe how to compute the bounds correspond- 
ing to this partition. These theorems, however, do not 
say how to progressively refine these bounds. In this sec- 
tion we explain how to construct a sequence of progres- 
sively finer partitions for a hypothesis, that will yield, 
in turn, a sequence of tighter bounds for the hypothesis. 

For each hypothesis 77 G H we define a pair of 
progressively finer sequences of partitions of O and 
|770 ^1 and |77|^^|, respectively. These sequences, in 
general, are different for different hypotheses. The se- 
quence 1 77^^ I is defined inductively by 



'e,o 

^f,fe+i=[^f,A^f]U7r(C) (fc>0), 



and 



(85) 
(86) 



where 9^ e TTf^^, 7r{ 



is a partition of 0^ , and the 
set 6q C is the smallest axis-aligned rectangle that 
contains the projection on the camera retina of the sup- 



port of the hypothesis 77 (defined in ^3.3). 



obtain "rectangles" that will have a ratio closer to 1 in 
the next iteration. 

The sequence |77|^^|, on the other hand, is con- 
structed from the sequence |7r^^| and the quantities 

{^r^fe(^)} (with e T^f J. The quantity N^^{0) indi- 
cates the number of voxels "behind" the pixel G TT^^ 
at the k-th step (note that different pixels in 77^^ might 
have different numbers of voxels behind them). This 
quantity is initialized to 1, and then each time a pixel 
is subdivided, its number of voxels is doubled. In other 
terms, if the pixel chosen to be split in the k-th re- 
finement cycle is 0^ (see ([86|), the number of voxels 
behind the different pixels are computed using the fol- 
lowing recursion: 



1, 



(87) 



H 

r,k+l 



(0) 



otherwise. 



Thus, each partition 77|^^ is defined as 



ttH 



(89) 



where 77^^^ is as defined in (34)- (35). 



Let us now define the sequence of partition pairs 
as {nf } = I (^77f^, n^^k)]' The first pair in this se- 
quence, nf^, consists of a single voxel that projects to 
a single pixel (Oq^). Then during each refinement cy- 
cle a pixel is split (in general) into four subpixels, and 
its voxels are split (in general) into eight subvoxels, to 
generate a new pair 11^ of the sequence. 

For this new pair, lower and upper bounds for the 
evidence 7^(77), respectively Lyi^{H) and L^h{H)^ 

k k 

could be computed by adding the Nq = \nM u \ terms in 



(37) and (76), respectively. However, these bounds can 



be computed more efficiently by exploiting the form of 



(86), as 



[jYiH (77) 



- E ^oiH). (90) 



(A similar expression for the upper bound Lyih^_^{H) 

can be derived.) Since the partition 7r{6^) in general 
contains just 4 sets, only 4 evaluations of Cq and Cq are 
required, using (38) and ([77[) respectively, to compute 



In order to define the partition 7r(6>f ) in ([86| we the new bounds Ljj^ ^(77) and Lyih ^(77) for 7^(77). 



adopt the following rules. When the ratio of ^^'s height 
over its width is close to 1, 7r{0^) consists of 4 (approx- 
imately) equal rectangles obtained by splitting 0^ (ap- 
proximately) in half along each dimension. When this 
ratio is not close to 1, 0^ is split in such a way as to 



fc+i ^ ' -^-^fc+i 
However, since the number of voxels in a pixel is 

doubled each time a pixel is subdivided, the cost of com- 
puting Cq and Ce correspondingly increases. To avoid 
this, we do not subdivide uniform voxels^ defined as 
those where the function 6bh uniform in the voxel. 
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Moreover, we join consecutive uniform voxels where 
5bh takes the same value. In this way the number of 
voxels in a ray does not grow unboundedly, but rather it 
remains on the order of the number of consecutive uni- 
form regions in a ray (typically three, namely: empty 
space, object, empty space). Consequently, the cost of 
computing Cq and Ce is also bounded, as is the total 
cost of a refinement cycle. 



While any choice of 6^ from ^ in (86) would re- 



sult in a new partition pair n^^-^ that is finer than 11^, 
it is natural to choose 0^ to be the set in 11 ^ with the 
greatest local margin {jCqh{H) — Cqh{H)) since this is 
the set responsible for the largest contribution to the 
total margin of the hypothesis {Lyih{H) — Lyih{H)). 

k k 

In order to efficiently find the set 0^ with the great- 
est local margin, we store the elements of the partition 
Uq in a priority queue, using their local margin as 
the priority (a different priority queue is used for each 
hypothesis). For further details about the refinement of 
the bounds see p^, Section 6.3]. 

This concludes the description of the BM used to 
solve our problem. Before presenting in ^ the results 
obtained with this BM integrated in a H&B algorithm, 
we review next the steps of the proposed approach. 



7 Summary of the proposed method 

Having completed the description of all the components 
of the proposed approach, we now summarize the steps 
involved in its execution. First, during the initialization 
stage, the bounds corresponding to all the hypotheses 
are initialized. For this purpose, for each hypothesis 
G H, the following steps are performed: 1) the set 



Oq^ defined in (85) is estimated; 2) the lower and upper 
bounds corresponding to this set, L{H) and L{H), re- 
spectively, are computed using ([371- ([38]) and (|76|)-(77), 



respectively; and 3) the set Oq is inserted in an empty 
priority queue 11^ using the margin L{H) — L{H) as 
the priority. 

Then, during each cycle of the the refinement stage, 
the following steps are performed: 1) a single hypothesis 
H is selected for refinement (as mentioned in ^1.5[ ); 2) 
the pixel 0^ with the largest margin is extracted from 
the priority queue 11^; 3) this pixel 6^ is divided into 
the subpixels 7r{6^); 4) the bounds Cq and Cq are com- 



puted for each subpixel 6 G 7r{6^), using (38) and (77) 
respectively; 5) each set 6 G 7r{0^) is reinserted in the 
priority queue 11^ using the local margin Cq — Cq as 
the priority; and lastly 6) the bounds corresponding to 



concludes when a hypothesis is proved optimal, or a set 
of hypotheses is proved indistinguishable. 

8 Experimental results 

In this section we show results obtained with the frame- 
work described in previous sections. To illustrate the 
process of discarding hypotheses we first show experi- 



ments on a single image (^8.1 ) and then present a quan- 



titative analysis of the results obtained on a dataset 
containing multiple images (^8.2[). 



8.1 Experiments on a single image 

In order to highlight several of the method's unique 
characteristics, we first describe the results of three 
experiments. In these experiments a known object is 
present in the input image and our goal is to estimate 
its pose (its class is known in this case). For this purpose 
we define the hypothesis spaces Mi (used in Experiment 
1) and H2 (used in experiments 2 and 3) and use our 
framework to select the hypothesis G (z = 1,2) 
that maximizes the evidence L{H). 

As mentioned before a hypothesis consists of an ob- 
ject class K and a transformation (or pose) T. Hence, 
the sets Hi and H2 are defined as = {{K object ^T) : 
T G T^} {i = 1,2), where Kobject denotes the class 
comprising only the known object {i.e., there is no un- 
certainty in the shape prior), and the T^'s are sets of 
transformations containing different horizontal transla- 
tions. To formally define these sets, let us denote by i, 
j and k the vectors that are 1cm long in the direction 
of the X, y and z axes (Fig. [5^), respectively, and define 
the transformation 



(91) 



The sets Ti and T2 are then defined as Ti = {Tt^ty • 
t^e{0: 3.2 : 9.6}, ty = O} and T2 = {Tt^ty : e {-15 : 
0.5 : 15}, ty G {-35 : 0.5 : 20}}, where {El : A : Er} 




the hypothesis H are updated according to (90) using 
the subpixel bounds Cq and jCq. The refinement stage 



Fig. 5 (a) Coordinate axes in the WCS. Each axis starts at 
the origin and is 10cm long, (b-e) The four hypotheses in Mi 
proposed to "explain" the (same) input image. The support 
of each hypothesis is indicated by the overlaid 3D box. 
Only a part of the image is shown, the actual image is larger. 
All images used in this work contain 640 x 480 pixels. 
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Fig. 6 (a) Evolution of the bounds of the four hypotheses in 
Fig. ^p-e. The bounds of one hypothesis are represented by 
the two hues of the same color, which is also the color of the 
hypothesis in Fig. [s] Cycles in which a hypothesis is selected 
for refinement or discarded are indicated by the markers 'o' 
and 'x', respectively. Initialization cycles (on or before cycle 
0) are indicated by the gray background. The blue and cyan 
hypotheses are discarded during the 1^* refinement cycle. The 
green hypothesis is discarded during the 22^^ cycle, proving 
then that the red hypothesis is optimal, (b) Number of voxels 
processed, N^^{6^)^ for each pixel 0^ processed (see 

denotes the set {EL^kA:ken,{)<k<{ER-EL) / A). 
In other words, Hi contains four equispaced hypothe- 
ses along the x direction (depicted in Fig. ^p-e) and 
H2 contains 6, 771 hypotheses produced by combining 
61 translations in the x direction and 111 translations 
in the y direction. The transformations must be inter- 
preted as "moving" the object away from the ground 
truth position (hence, the method should ideally select 
the hypothesis corresponding to the identity transfor- 
mation). 



Experiment 1. When the proposed method was ap- 
plied to a noiseless BF with the hypotheses in Hi, the 
bounds corresponding to these hypotheses evolved as 
depicted in Fig. [6^. In this case one hypothesis was 
proven to be optimal after 22 refinement cycles. Fig. 
l6b shows the number of voxels processed during each 



computation of bounds (i.e., each time (38) and (77) 
were evaluated). Since elements in the partition IIq^. 
are split (in general) in 4 during each refinement cycle 
(as explained in Q, the number of bound computations 
is roughly four times the number of refinement cycles. 
It can be seen that as the bounds of one hypothesis H 
are refined, the elements of ^ become smaller and 
tend to initially have a larger number of voxels project- 
ing to them. As explained in ^ the number of voxels 
initially doubles (steps 0-11 in the figure), and then it 
grows more slowly or even decreases (steps 12 and after) 
whenever we merge uniform voxels. For these reasons 
refinement cycles became more costly up to a point, 
and then they plateau. 



Fig. [7| shows the state of the partitions ^e^end' • • • •> 
-^0 end^ corresponding to the four hypotheses in Hi, 
when the process terminated. These partitions are suf- 
ficient to discriminate between the hypotheses and the 
additional available resolution does not infiuence the 
computational load. Thus the computational load de- 
pends on the task at hand ( through the set of hypothe- 
ses to he discriminated) and not on the resolution of the 
input image. 




Fig. 7 Partitions obtained after 22 refinement cycles. These 
partitions are sufficient to find the best hypothesis among 
the four defined in Fig. |5] The color of each partition element 
represents the margin of the element divided by its area. The 
silhouettes of the object in the input image are displayed 
for reference only. Note that the blue and cyan hypotheses 
are discarded by looking at a single partition element (and 
computing a single pair of bounds) . 

Experiment 2. Fig.[8]shows the final bounds obtained 
for the best hypotheses in H2 when our method was 
used to compute the bounds for this set. Note that at 
termination time there are still 19 hypotheses in the 
active set A that cannot be further refined (recall that 
a hypothesis is in A if its upper bound is greater than 
the maximum lower bound). These hypotheses are in- 
distinguishable given the current input (as defined in 



1.5) and will be referred to as solutions. Three solu- 
tions are depicted in Fig. [9^: the ground truth solution; 
the best solution (i.e., the one having the greatest up- 
per bound); and the solution farthest away from the 
ground truth. 

In order to quantify the quality of the set of solu- 
tions A we define for each parameter t of the transfor- 
mations the bias of t and the standard deviation of t, 
respectively as 



CTt 




and 



(92) 



(93) 



where tn is the value of the parameter t corresponding 
to the hypothesis H e A and ttme is the true value of 
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Fig. 8 (a) Final 
bounds for the subset 
of the best hypotheses. 
The number of refine- 
ment cycles allocated 
to each hypothesis 
is indicated above 
each hypothesis, (b) 
The translation cor- 
responding to each 
solution. Hypotheses 
can be identified by 
their color and marker. 
The ground truth is 
indicated by the red 
arrow. 



the parameter t. The values of these quantities obtained 
for this experiment are summarized in Table [l] 

Table 1 Pose estimation errors for a known object in a noise- 
less image. 




Mt. (cm) 


Mt^ (cm) 


at^ (cm) 


^ty (cm) 


|A| 


-0.24 


0.61 


0.34 


1.54 
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In the ideal case the indistinguishability of a group 
of hypotheses is a consequence solely of the resolution 
of the input image. In practice, however, other factors 
enter into play, including the inaccuracy of the camera 
model and calibration, the noise corrupting the input 




Fig. 9 (a) Three of the hypotheses in EI2 that could not be 
distinguished given the current input: the ground truth ( ), 
the best hypothesis (having the greatest upper bound, o), and 
the one farthest away from the ground truth ( ). (b) The final 
partition obtained for the best hypothesis (the one with the 
highest upper bound, indicated by the marker o in Fig. [8|. 
Colors indicate the margin of each partition element. Note 
that most of the work is performed around the edges of the 
image or the prior, and that uniform pixels/voxels are not 
further subdivided. 



Number of active hypotheses 




Mean error and Standard deviation 
of the active set (cm) 



3 4 5 6 7 8 9 10x10 

Refinement cycle 

Fig. 10 Number of active hypotheses (red), mean error 
(blue) and standard deviation (green) of the active set vs. 
refinement cycles performed. 

image, the fact that m (in the m-summaries) is finite, 
and the approximation in the computation of the sum- 



maries (explained in [10 in the supplementary mate- 
rial) . 

As the bounds are refined some hypotheses are dis- 



carded, while others remain in the active set. Fig. 10 
shows the number of active hypotheses remaining after 
each refinement cycle, as well as the mean error and the 
standard deviation (\/Mt^^ + j^ty'^ and ^/o%^^^^-~o^ , 
respectively) of the active set after each refinement cy- 
cle. Note that both the number of hypotheses in the 
active set and its standard deviation are non-increasing 
functions. The mean error of the active set, however, in- 
creases at times because the hypotheses "on one side" 
or "on the other side" of the ground truth are not dis- 
carded at exactly the same time. 

Fig. (9)3 shows the final partition obtained for the 
best hypothesis. Note that the partition is finest in the 
area around the edges of the bottle, and coarsest in the 
area "outside" the bottle. This behavior emerges auto- 
matically (i.e., it does not have to be explicitly encoded 
in the framework) as the algorithm greedily reduces the 
uncertainty of each hypothesis by subdividing the par- 
tition elements with the greatest margin. Note also that 
the partition inside the silhouette of the bottle is finer 
than outside of it. This is because pixels "inside" the 
bottle, even if they are not near the edges, still have to 
be divided in order to divide their voxels and obtain an 
accurate reconstruction (since in the current implemen- 
tation voxels are divided in depth only when the pixel 
they project to is also divided). 

Fig. [11] shows how the computation is distributed 
among the hypotheses in H2. It can be seen in Fig. 
Hi that most hypotheses (92.2%) only require 0/1 re- 
finement cycles, while only a few (0.41%) had to be 
processed at the finest resolution. (The exact number 
of refinement cycles allocated to each hypothesis in the 
set A is indicated above each hypothesis in Fig. [S^i.) 
Fig. [TTJd shows that the hypotheses that require most 
computation surround the ground truth, however, not 
isotropically: hypotheses in the ground truth's line of 
sight are harder to distinguish from it (compare cr^^ 
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Fig. 11 (a) Histogram of refinement cycles allocated per hy- 
pothesis. Most computation is allocated to a few hypotheses, 
(b) Refinement cycles allocated to each hypothesis in the hy- 
pothesis space EI2 (note the logarithmic scale). The position 
of the true hypothesis is indicated by the marker 'x'. The 
hypotheses gathering most computation surround the true 
hypothesis. 

and aty in Table [l] and look at the shape of the active 
set in Fig. [sJd). Hence, most hypotheses are discarded 
with minimal computation while only the most promis- 
ing hypotheses gather most of the computation. This is 
the source of the computational gain of our algorithm. 

It is difficult to accurately quantify this computa- 
tional gain in general^ since it strongly depends on the 
position of the object in the input image, the level 
and type of noise affecting this image (to be discussed 
later in this section), the task at hand (through the 
set of hypotheses that must be discriminated) and the 
shape priors used. However, for illustration purposes 
only, it is possible to quantify this gain for the cur- 
rent experiment by comparing the number of voxels 
processed by our approach, versus a naive approach 
defined as follows. Suppose that to select the best hy- 
pothesis we directly compute the evidence L{H)^ us- 



ing (31), for each one of the 6,771 hypotheses in H2. 
Note that this entails processing, for each hypothesis^ 
all the pixels and voxels in the relevant parts of the 
image and world space (i.e., those presumed to con- 
tain the object). In this particular example the rel- 
evant part of the image contains approximately 50k 
pixels, and the relevant part of the world contains ap- 
proximately 12.8M voxels (= 50k pixels x 256 radii). 
Hence the naive approach would need to process ap- 
proximately 339M pixels (= 6, 771 x 50k) and 86. 7G 
voxels (= 6, 771 x 50k x 256). 

In contrast, in the proposed approach only 440k pix- 
els {i.e., elements of n{0)) and 3.6M voxels {i.e., el- 
ements of n{^)) are processed. This is a 770-fold re- 
duction of the number of pixels processed, and a 24k- 
fold reduction of the number of voxels processed. On 
the other hand, if the accuracy given by the set of hy- 
potheses Hi is sufficient for a particular application, our 
method would only need to process 90 pixels and 437 
voxels (these voxels can be directly counted in Fig. ^ 



This yields a 4M-fold reduction in the number of pixels 
to process and a 200M-fold reduction in the number of 
voxels to process, a significant efficiency gain. For this 
reason we said that the computation depends on the 
task, in particular in the precision (in the class or pose) 
required by the task. Moreover, to obtain this gain it 
is not necessary to down-sample the input image a pri- 
ori when the task might not even be defined yet; the 
framework automatically uses the appropriate resolu- 
tion. Interestingly, the pixels and voxels processed for 
one hypothesis in the naive approach are all disjoint, 
while those processed in the proposed approach are not: 
pixels and voxels processed later lie within pixels and 
voxels processed earlier. 

Recall that two 3D shapes are obtained while com- 
puting the bounds of a hypothesis: a discrete 3D shape 
V and a semidiscrete 3D shape v are obtained when the 



lower and upper bounds are computed using (39) and 



(78), respectively. These shapes are progressively re- 



fined as the bounds are refined (Fig. 12). These shapes 
are initially defined on a partition containing a single 
voxel (left column), which is then refined to contain 
hundreds of thousands of voxels after 5,000 refinement 




Fig. 12 Diff"erent stages of the reconstructions obtained 
while computing the bounds of the best hypothesis. These re- 
constructions are given by the discrete shape (1^* and 2"^^ 
rows) and the semidiscrete shape v (S^'^ and 4*^ rows). Each 
column contains two renderings of each of these shapes after 
(from left to right) 1, 10, 100, 1, 000 and 5, 000 refinement cy- 
cles. Each rendering was obtained from a diff"erent viewpoint. 
In one rendering the camera was located in the same pose as 
in the original input image (1^* and 3^^ rows), while in the 
other the rendering camera was rotated 90° to the left of the 
object {2^^ and 4*^ rows). The triangles on the floor point 
in both cases towards the original camera. In the case of v 
the transparency of each voxel indicates the fraction of the 
voxel that is full, i.e., v{^j^i)/\^j^i\. A perfectly transparent 
voxel indicates 0%, while a perfectly opaque voxel indicates 
100%. Shadows and reflections were added for visualization 
purposes only. 
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cycles (right column). It can be seen that the two shapes 
V and V get progressively "closer to each other" and ap- 



proach the continuous shape v that solves (31). In ^8.2 
we will show reconstructions obtained when neither the 
object nor its class are known. 

Experiment 3. Experiments 1 and 2 assumed that 
a noiseless BF with the silhouette of the object can 
be obtained from the input image. In real scenarios, 
however, this is rarely the case. Camera noise, shadows 
and reflections, among many factors, cause the resulting 
BF to rarely be the exact silhouette of an object, but 
rather to have errors. 

In order to quantify the effect of noise on the perfor- 
mance of our framework we look at how different kinds 
and levels of noise affect the quality of the pose estima- 
tion. For this purpose we run our approach exactly as in 
Experiment 2, except that we degrade the image with 
noise. For simplicity and to be able to precisely control 
the amount of noise introduced, we add synthetic noise 
directly to the BF corresponding to the ground truth 



segmentation of the input image (Fig. 13 d), rather than 
to the RGB input image itself (Fig. 13 i). 

Three kinds of noise have been considered: 1) salt 
and pepper noise (Fig.jlSj^), SV{P), produced by chang- 
ing, with probability P, the success rate of a pixel x 
from p(x) to 1 — p(x); 2) structured noise (Fig. [l3]i), 
S{i), produced by changing the success rate from p(x) 
to 1 — p(x) for each pixel x in rows and colums that 
are multiples of £; and 3) additive, zero mean, white 
Gaussian noise with standard deviation a (Fig. [l3^), 
denoted by A/'(0, cr^). When adding Gaussian noise to 
a BF some values end up outside the interval [0, 1]. In 
such cases we trim these values to the corresponding 
extreme of the interval. 

In addition to these types of noise, to simulate a 
more realistic scenario, we also consider BFs produced 
by a simple background subtraction algorithm (as de- 



(d) {i 





Fig. 13 Different input images considered in this work, (a) 
Original RGB image (only part shown), (b) Corresponding 
"ground truth" BF. (c-e) Ground truth BF corrupted with 
salt and pepper noise, SV{0.1), structured noise, S(20), and 
Gaussian noise, A/'(0, 0.2^), respectively, (f) BF obtained by 
background subtraction. 



scribed by (|14|)). The distribution of features for a pixel 
X in the Background, px (/(x)|(7(x) = 0), is chosen to 
be a Gaussian probability density function whose mean 
and variance are learned from an image of the scene 
without the object (one such density is learned for each 
pixel x). The distribution of colors for pixels in the 
Foreground, p {f{x.)\q{x.) = 1), is represented by a mix- 
ture of Gaussians whose parameters are learned from a 
few pixels on the object that we manually select (only 
one density is learned for all foreground pixels). The de- 
tails of the background subtraction algorithm are not 
important here. Our algorithm uses prior geometric 3D 
information to improve any BF (or segmentation) ob- 
tained with any algorithm, as long as it is given as a 
foreground probability map, i.e., a 2D BF. In fact, the 
BFs used in the following experiments purposefully con- 
tain artifacts to resemble realistic scenarios. A subset 
of these BFs is shown in the first row of figures T6j 20 
and EH 

The results of these experiments are summarized in 
Table [2] To reduce the variation in the results produced 
by the variation in the noise itself, in the table we re- 
port the average of each quantity over 10 runs of the 
algorithm. For convenience the results of Experiment 2 
(the noiseless case) are also included in this table. 



Table 2 Pose estimation errors for a known object in a noisy 
image. 



Noise 


(cm) 


(cm) 


(cm) 


(cm) 


|A| 


No noise 


-0.24 


0.61 


0.34 


1.54 


19.0 


5^(0.05) 


-0.17 


-1.25 


0.29 


2.08 


17.1 


5^(0.10) 


0.00 


-4.66 


0.00 


4.74 


5.7 


5(40) 


-0.20 


-1.24 


0.36 


2.16 


13.0 


<S(20) 


-0.14 


-4.28 


0.16 


4.46 


6.2 


A/'(0,0.10^) 


-0.25 


0.03 


0.35 


2.43 


33.1 


A/'(0,0.20^) 


-0.21 


-0.51 


0.51 


3.10 


66.9 


Back. sub. 


-0.28 


0.00 


0.37 


1.46 


20.0 



It can be observed that the estimation of the posi- 
tion of the object in the x direction was relatively un- 
affected by these types and levels of noise (see columns 
labeled fif^ and a^^). Similarly, the errors of the po- 
sition estimate in the y direction were not affeted by 
the Gaussian noise, but they significantly increased for 
the other types of synthetic noise (see columns labeled 
fit^ and at^ ) . The results in the background subtraction 
case, on the other hand, were in every case at the same 
level as the noiseless case. 

Table [3] contains the total number of pixels (r) and 
voxels {u) processed by the algorithm under each noise 
condition. This table indicates that BFs obtained by 
background subtraction having the level of artifacts 
shown in Fig. [Tst require a slight amount of additional 
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computation. BFs corrupted by higher levels of syn- 
thetic noise, on the other hand, require significantly 
higher amounts of computation (in particular for salt 
and pepper and structured noise). The additional com- 
putation is required because when the input images are 
corrupted by noise, more cycles have to be spent before 
the group of hypotheses that will ultimately become 
solutions can be identified. In other words, more cy- 
cles are spent refining hypotheses that are eventually 
discarded. 

Table 3 Amount of computation needed to estimate the pose 
of a known object in a noisy image. 



Noise 


T (xlO^) 


V (xlO^) 


No noise 


0.44 


3.63 


5^(0.05) 


4.00 


24.66 


5^(0.10) 


15.92 


94.71 


5(40) 


2.41 


14.66 


5(20) 


8.76 


49.52 


A/'(0,0.10^) 


0.96 


7.98 


A/'(0,0.20^) 


2.54 


20.00 


Background subtraction 


0.65 


5.06 



8.2 Assessment of the performance on a larger dataset 

In contrast with the previous section, in this section 
we look at the statistical performance of the framework 
on a dataset containing 32 images (see examples in the 



first row of figures 16, 20 and[2T[). The image BFs (i.e., 
Bf in (31)) were obtained using Background subtrac- 
tion, and the BFs for the shape priors (i.e., the 5k's) 
were computed from a sample of training 3D shapes 
for each class. We split each class in subclasses by clus- 
tering the shapes in the class, and then compute a BF 
for each subclass using all the objects in the subclass 
(see details in [14 ). We denote by Kc^ass the set of 
subclasses of a class. For the classes 'cups,' 'bottles,' 
'plates,' 'glasses' and 'mugs,' we defined 9, 3, 2, 1 and 



16 subclasses, respectively. Fig. [M] shows 2D cuts of the 
3D BFs obtained for some of these subclasses. 

To define the hypothesis spaces we define the trans- 
formation 



T*(X) 4 Ti^i^ (^R,{<l>)S,{s,)S^y{s,y)Xj, 



(94) 



which depends on the vector of parameters ^ = [txtycj) 
SxySz]^ and combines the horizontal translation Tt^t in 
(91 ), with a rotation of (j) degrees around the vertical z 



axis, Rz{(t^)^ a scaling of the z axis by 5^%, Sz{sz)^ and 

Sxy{sxy)- Unless 



y axes by 53,^^%, 



a scaling of the x 
otherwise stated the parameters of this transformation 
are in the following sets: G {— 3 : 0.5 : 3}, G {— 9 : 
1.5 : 9}, G {-80 : 20 : 80}, s^^y G {-20 : 5 : 20} and 
8^ G {-20 : 5 : 20}. 



Fig. 14 Vertical cuts through the 3D BFs corresponding 

to subclasses in Kmugs (a-d), ^cups (e-g), ^glasses (h), 

^bottles (i-j), and Opiates (k) . Colors indicate the proba- 
bility that each point in the vertical plane would be inside an 
object of the subclass. 

We then define the hypothesis spaces EI3 = {{Kohject^ 
Tq,) : (j) = Sxy = Sz = 0} and H4 = {{Kobject.T^) - 
^xy = = 0} for the case where the object is known 
{K object is then a "class" containing just this object), 
the hypothesis spaces EI5 = {(i^, T^) : K G Kdassi 4> = 
0} and He = {{K,T^) : K G Kdass} for the case 
where the object is not known a priori, but only its 
class is, and the hypothesis space H7 = {(i^T, T^) : K e 
AUG las SCSI 4^ ^ {—60 I 20 I 60}} for the case where nei- 
ther the object nor its class are known (KAiiciasses = 
^ class '^ciass)' Notc that whcu the object is known (i.e., 
for H3 or H4) there is no need to estimate Sxy and 
(because the object dimensions are known). Similarly 
does not need to be estimated when the object is known 
to belong to a rotationally symmetric class (e.^., bot- 
tles, cups, glasses or plates), only when it belongs to a 
non- symmetric class (e.^., mugs). For this reason the 
sets EI3 or H5 are used in the first case and the sets EI4 
or Me are used in the second. 

Some comments about the choice of the parameter 
ranges are in order. The ranges of tx and ty were re- 
stricted (with respect to those in Experiment 2) to save 
memory, since it was shown in Fig. [TTJd that hypothe- 
ses farther away from the ground truth are immediately 
discarded. Moreover, the distance between hypotheses 
was adjusted to be in the order that the framework can 
distinguish (from Table[2] 0.5cm and 1.5cm in the x and 
y directions, respectively). The range of (j) was restricted 
in H7 to avoid ambiguities between mugs and glasses. 
These ambiguities result when a mug is rotated in a 
way that hides its handle and hence it cannot be distin- 
guished from a glass. These ambiguities were avoided to 
distinguish classification failures due to problems with 
our method from those intrinsic to the problem formu- 
lation. 

Pose estimation. Table |4] summarizes the pose esti- 
mation errors obtained on the hypothesis spaces defined 
before. As expected the precision is in general reduced 
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{i.e. the standard deviations increased) and more solu- 
tions are found when the number of degrees of freedom 
is increased. Note that even in the case where the class 
is unknown (Hr), the pose parameters can be estimated 
accurately. The largest standard deviation is observed 
for the parameter cj) of the rotation Rz around the ver- 
tical axis, because this rotation affects only a small part 
of the object (the mug's handle), and because this part 
is highly variable and hence not encoded in the BFs 
as well as the main body of the mugs (see Fig. [T4^-d). 
This problem would be solved with a larger training 
dataset of 3D shapes (currently containing between 6 
and 36 objects per class) and more elaborated methods 
to construct BFs. This issue will be further discussed 
infl 

Table 4 Pose estimation errors for hypothesis spaces with 
different number of degrees of freedom. 
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3.2 


Ms 


-0.1 


0.5 




0.1 
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1.4 
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22 


Me 


-0.1 


0.5 


-2 


0.4 


0.2 


0.2 


1.5 


41 


7.8 


2.9 


422 


Mr 


-0.1 


0.3 


-3 


2.4 


3.5 


0.3 


1.7 


32 


9.2 


7.7 


153 



Classification. In the experiments corresponding to 
H7 the object classes were not known, and they were 
thus estimated in addition to the pose parameters. Since 
the proposed approach does not necessarily associate a 
single class to each testing image fk (because the cor- 
responding set of solutions might contain solutions 
of different classes), we report the performance of the 
framework with a slight modification of the traditional 
indicators. 

Let class{H) and class{fk) be the class of the hy- 
pothesis H and the true class of the object in image //e, 
respectively, and let Ei = {k : class{fk) = i} be the set 
of indices of the images of class i. An element {i^j) in 



the confusion matrix Cq (in Fig. 15 ) indicates the total 
normalized percentage of solutions of class j obtained 
for all testing images of class z. 



r A 100 ^ 

keEi 



\{H e Ak : class{H) = j}\ 



(95) 



Note that if only one solution is found per experiment, 
this formula reduces to the standard confusion matrix. 

It is also of interest to know what the classification 
performance is when only the best solutions are consid- 
ered. For this purpose we define the confusion matrix 
C'/3 (0 < /3 < 1) as before, but considering only the so- 
lutions whose upper bound is greater or equal than 7^3, 
where 7/3 = L-\-f3{L — L) and L and L are the maximum 
lower and upper bounds, respectively, of any solution. 
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Fig. 15 Confusion matrices obtained in the classification for 
/3 G {0, 0.5, 1}. See text for details. 

Note that when /3 = all the solutions are considered, 
and when /3 = 1 only the solution with the largest up- 
per bound is considered. The confusion matrices Co. 5 



and Ci are also shown in Fig. 15 



3D reconstruction. As explained in Qand ^two 3D 
reconstructions are obtained for each solution (selected 
in this case from the set EI7). These reconstructions are 
given by the discrete shape v and the semidiscrete shape 
V obtained, respectively, while computing the lower and 
upper bound for the solutions. 

In order to quantify the quality of these reconstruc- 
tions we computed the error in the reconstruction v 
obtained for the best solution {v is almost identical to 
v), by measuring its distance to the corresponding true 
shape Vtrue- The distance d{v^Vtrue) is defined as the 
normalized measure of the set where the two shapes 
differ, i.e., 

d{v, Vtrue) = d[v, Vtrue) = ^ ^ , (96) 

where v is the continuous shape produced from v after 
this shape is translated to be optimally aligned with 
Vtrue- This alignment is performed to disregard the er- 
rors in the reconstruction resulting from errors in the 
pose, since those errors were already reported in Table 
[4] Using this metric we obtained a mean reconstruction 
error of 16.7 %. 



Fig. 16 shows the reconstructions v obtained for the 
best and worst solutions (as indicated by their upper 
bound) in five different experiments, one for each class 
considered. Note that in most cases the best and worst 
reconstructions are very similar. It can be seen that the 
3D reconstructions look better from viewpoints that 
are closer to the original viewpoint (in the input im- 
age), than from viewpoints that are "orthogonal" to 
it (compare the 2"^^ and 4^^ rows, with the 3^^ and 
5^^ rows, respectively). The explanation for this is that 
from viewpoints close to the original viewpoint we see 
the best parts of the reconstruction, i.e., those in which 
information from the shape prior and from the input 
image was used. In contrast, from viewpoints that are 
orthogonal to the original viewpoint, we see the worst 
parts of the reconstruction, where only the information 
from the shape prior could be used. 
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Fig. 16 3D reconstructions v obtained in five difi"erent ex- 
periments. For each input image (shown in the 1^* row after 
the background was "subtracted"), two reconstructions were 
computed, and for each reconstruction, two views are shown 
(as in Fig. |12|). The reconstructions correspond to the solu- 
tions with the highest (2^^ and 3^^ rows) and lowest (4*^ 
and 5*^ rows) upper bound. In the second reconstruction for 
the glass (4*^ column, 4*^ and 5*^ rows) the class 'glasses' 
was mistaken for the class 'mugs.' 



The reconstructions v are in general almost identical 



to the corresponding reconstructions v. Fig. 20 (in the 
supplementary material) shows the reconstructions v 
corresponding to the reconstructions v depicted in Fig. 
\T6\ Additional reconstructions are also shown in Fig. 21 
(in the supplementary material). 

In rays containing only points deemed likely to be 
Out of the reconstruction (according to the shape prior) 
but that nevertheless project to pixels that are likely 
to be Foreground (according to the input image), the 
framework is forced to make a compromise between the 
contradictory information in the input image and the 
shape prior and to add a small amount of mass in the 
inner side of ^. While this is perfectly correct from the 
optimization perspective, the added lumps of mass con- 
stitute artifacts in the reconstruction. These artifacts, 
however, are very easy to detect and remove and thus 
this was automatically done in all the reconstructions 
shown in this work. 



2D segmentation. Recall that a pair of segmenta- 
tions is also obtained for each solution, along with a pair 
of bounds and reconstructions. These segmentations are 
given by the discrete shape q and the semidiscrete shape 
q obtained while computing the lower and upper bounds 
for the solutions, respectively. The segmentations q cor- 
responding to the reconstructions v depicted in the 2^^ 
and 3^^ rows of Fig. 16 are shown in Fig. 17 i. These 



segmentations were obtained for a value of A in (31), 
namely Xopt^ which was chosen to make the weights of 
the first and third terms of that expression equal, and 
which was found to minimize the pose estimation error. 
This value Xopt thus depends on the BFs Bf and Bx 
corresponding to the input image and the class priors, 
and it might be different in different experiments. 

If on the other hand one is interested in "fixing" 
artifacts produced by the background subtraction pro- 
cess by considering prior 3D shape information, then 
we need to increase the weight given to the shape prior 
term (i.e.. A). Segmentations obtained with A = 2Xopt 
are shown in Fig. ^7)p. Note how this larger A fixes the 
cup's "hole" in the leftmost column. The segmentations 
q are in general almost identical to the corresponding 
segmentations q (see Fig. 22 in the supplementary ma- 



terial). Additional segmentations are shown in Fig. 23 
(in the supplementary material). 



(a) 



(b) 



T I 
T I 



I • 



Fig. 17 Segme ntat ions q corresponding to the best solutions 
depicted in Fig. [16] obtained with A = Xopt as in Fig. [16] (a) 
and with A = 2Xopt (b). 



Comparison with an alternative approach. As 

mentioned in ^we consider the work of Sandhu et al. 
[15] to be the closest to ours, even though in that work 
the problems of classification and 3D reconstruction are 
not addressed (but could be addressed with some ma- 
jor modifications to the framework). One of the main 
differences between our proposed approach and the ap- 
proach in [15 is that the latter requires a good estima- 
tion of the pose to initialize the optimization, while our 
approach does not. 

In order to illustrate this point we show in Fig. 18 
the region of initial poses that lead to the correct solu- 
tion being found by the the approach in [15 . In other 
words, we repeated Experiment 2 (in ^8.1[ ) using the 
approach in [15^ and using each hypothesis in H2 as 
an initial condition. We observed that only when the 
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-5^^^^^^^^^^ Fig. 18 Convergence of the ap- 

^^^^^^^^^H proach in 15 to the correct solution. 

^^^^^^^^^H When the approach in [15] is initial- 

_^5^^^^^^^^^H ized with an hypothesis inside the 

^^^^^^^^^H area in red, this approach finds the 

-20^^^^^^^^^! correct solution. Otherwise it finds 
a diff"erent solution or does not con- 

X(cm) verge at all. 

initial pose is close to the true pose this approach con- 
verges to the right solution (otherwise it does not con- 
verge or converges to a different solution). While this 
could be solved by running that framework with dif- 
ferent initial conditions (if the true solution is approxi- 
mately known), the fact that each initial condition has 
to be "fully" processed significantly increases the cost 
of the approach. In contrast in our approach only the 
hypotheses close to the true hypothesis have to be fully 
processed. This experiment is described in detail in p2] 
in the supplementary material. 

This concludes the presentation of the experiments. 
In the next section we present our conclusions and pos- 
sible directions for future work. 



9 Conclusions 

This article introduced an inference framework to si- 
multaneously tackle the problems of 3D reconstruction, 
pose estimation and object classification from a single 
input image, by considering shape cues only and by 
relying on prior 3D knowledge about the shape of dif- 
ferent object classes. The proposed inference framework 
is based on an H&B algorithm, which greatly reduces 
the amount of computation required while still being 
guaranteed to find the optimal solutions. In order to 
instantiate the H&B paradigm for the current problem, 
we extended the theory of shapes and shape priors pre- 
sented in [14 to handle projections of shapes. 

While the proposed approach already provides state- 
of-the-art results, it still can be improved and extended 
in several directions. For example, it could be extended 
to exploit the redundancy among hypotheses, by group- 
ing them, computing bounds for these groups, and then 
discarding whole groups of hypotheses together (a la 
Branch and Bound). Other directions include consider- 
ing different types of input images (e.^., depth maps) 
or multiple cameras or videos. Fll I 
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10 Supplementary material: Computing 
summaries in constant time 

One of the main properties of summaries is that, for 
certain kinds of sets, they can be computed in constant 
time regardless of the number of elements {e.g., pixels) 
in these sets. Next we show how to compute the mean- 
summary Yb,^ and the m-summaries Yb,^ of a BF B for 
the set ^ C i? (defined below). For simplicity we assume 
that i? C M^, but the results presented here immedi- 
ately generalize to higher dimensions. We also assume 
that n{f2) = i^i,2 7 •••77 ^n,n} is a uniform par- 

tition of f2 organized in rows and columns, where each 
partition element f2ij (in the i-th row and the j-th col- 
umn) is a square of area \f2ij\ = uq. We assume that 
defined by its logit function ^b(x) (x G i7), was pro- 
duced from a discrete BF B in 71 (i?) (as described in 
Definition , and therefore Sb{'x.) = S^{i,j) Vx G f2ij. 

Computing mediH- summaries in a box. Let us assume 
for the time being that ^ is an axis-aligned rectangular 
region containing only whole pixels {i.e., not parts of 
pixels). That is. 



U 



(97) 



iL<i<iu 

jL<j<ju 



These special regions will be referred to as boxes. In 
order to compute the mean-summary Yb,^ in the box 



note that from (32), 



Yb,^ = yZ ^b(x) (ix 



3l<3<3u 



(98) 



iL<i<iu 

jL<j<ju 



The sum on the rhs of (98) can be computed in con 



stant time by relying on integral images [19j, an image 
representation precisely proposed to compute sums in 
rectangular domains in constant time. To accomplish 
this, integral images precompute a matrix where each 
pixel stores the cumulative sum of the values in pixels 



with lower indices. The sum in (98) is then computed as 



the sum of four of these precomputed cumulative sums. 

Computing m-summaries in a box. The formula to com- 
pute the m-summary Ye,^ in the box ^ is similarly de- 
rived. From (48), and since is constant inside each 



partition element, it holds for k = — m, . . . , m that 

kS 

m,n 



X G : (5b(x) < 



iL <i< iujL <j< ju.^B^hj) < 



k8r, 



(99) 



Let us now define the matrices Ik {k = — m, . . . , m) as 
[0, otherwise. 



Using this definition, ( [99| ) can be rewritten as 

iL<i<iu 
3l<3<3u 

which as before can be computed in 0(1) using integral 
images. 

Computing medin- summaries in a convex set. In gen- 
eral we are interested in cases in which <P is not axis- 
aligned or even rectangular; we only require <P to be 
convex. In this case we will not compute Yb,^ exactly, 
but rather we will find a lower bound for it. Note that 
by doing this we can still obtain valid lower bounds for 



the evidence using (38). 



Toward this end we partition ^ as {^^i 
'^1, • • • , '^n^ }, where each <Pi is a box (as defined in ([97|)) 
and each Vi is a set whose bounding box r{vi) is dis- 
joint with the other bounding boxes and the ^^^'s (see 



Fig. 19). Specifically, r{vi) is defined as the smallest 



box containing Vi. To obtain this partition we find the 
largest box inside <P and we label it as ^i. Then we 
"cut" ^ with the lines determined by the sides of 
yielding vi, . . . ,vs (see Fig. \i9]p). Next, the largest 
say vi , is selected and the largest box inside it is found 
and labeled ^2- And again, vi is cut with the lines de- 



termined by the sides of ^2 (see Fig. 19 3). This process 
is repeated a number of times, relabeling the ^;^'s at 
each step, until the desired summary precision is met. 




Fig. 19 Three partitions of the set ^ used to compute Yb,<p 
and ^. (a) The original set. (b and c) Partitions after one 
and two iterations, respectively. 
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For each partition, obtained at each step, it holds 



that 



(102) 



i=l 



i=l 



where the Yb^^^^s are computed exactly using (98). To 
bound the yB,?j/s, we observe that 



and that 



. = / -5b(x) 

J Vi 



> —S, 



max I h 



YB,riv.)= / (5B(x)dx+ / (5B(x)dx, (104) 

Jvi Jr{vi)-Vi 

and hence 

Yb^v. > YB^nv.) - Smax {\r{Vi)\ - \Vi\) . (105) 

Note that the summary in the first term on the rhs 



of ([105| can be computed exactly using (98) because 
r{vi) is a box. 



Substituting (103) and (105) into (102) yields the 



final lower bound for the mean-summary. 



^maxi^YB,rivi) - ^max{\r{Vi)\ - \Vi\) ,-Smax\yi\y 



i=l 



(106) 



Clearly better bounds for Yb,^ are obtained in finer 
partitions of ^ (greater n^) at a greater computational 
cost. We found that for our purposes = 10 in 2D, 
and = 30 in 3D, provide a good compromise. 

Computing m- summaries in a convex set. For a convex 
set ^ we are going to compute an upper bound ^ for 
rj^, by partitioning ^ into {^i, . . . .^n^.vi, . . . ,VnJ 
as before. Using this bound we will in turn obtain a 
valid upper bound for the evidence. 

Given a partition of <P as described above, it follows 



from (48) that 



(107) 



The m-summaries Yb^^^ in (^107fc can be computed ex- 



actly using ( 101 ). The m-summaries YB^Vi -> other 
hand, can only be bounded. Below we derive an upper 



Recall that our goal is to substitute these summaries 



in (50)-(51) to find an upper bound for the Ihs of (49). 



Since we do not know which of the values oi 5b inside 
r{vi) are actually inside we need to consider the 
worst case. This worst case is when the greatest values 
oidB inside r{vi) are actually inside Vi. In other words, 
we need to "fill" Vi with the greatest values oi 5b in 

In order to simplify the derivation of the bound 



(103) FJ^, we define the quantities 



A 1^-^)1- 



Y, 



B,r{vi)^ 

\/'k 

^B,r{vi)^ 



m. 



if k < m. 



(108) 



Each of these quantities, e.g. M^^^ y indicate the mea- 
sure of a set of the kind {x G r{vi) : kSmax/^ ^ 
(5b (x) < {k -\- l)Smax/^}- Similarly we define the cor- 
responding quantities for the upper bound of the sum- 
mary, Y^ y., that we want to compute. 



J- u 



B,v^ 



^B,Vi^ 



if /c = m, 
if A: < m. 



(109) 



Note that since Vi C r{vi)^ the quantity M^. is 
bounded above by the quantity M^^^ y Moreover, this 
quantity M^. is also bounded above by the remaining 
volume V^. in vi {i.e., the volume not yet "filled"), 
which can be written as 



Therefore we can compute M^. from M^^^ ^^ as 
M,^, =min{M^(„^),yj;}. 

Since it can be verified that V^. satisfies 
if = m, 



V, 



yk+l 



(110) 



(111) 



(112) 



it follows from (111) that the bound for the summary 



YB,vi computed with the following recursion (in 

decreasing order of k): 

%^ = \v,\ - min {|r(t;,)| - |t;,|} (113) 



Y, 



k+l 



Y B,r{vi 



Y, 



yk+iX 

B,r{vi)^ ^ B,Vi j 



(— m < k < m). 



(114) 



Thus the final upper bound for the m-summary is 
given by 



bound ^. for them. Substituting this bound in ( 107) y/e _ ^ ^ y-fc 

we will obtain the upper bound ^ for Yj^ ^. i=i i=i 



(115) 
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11 Supplementary material: experimental 
results 

This section contains additional results that due to the 
space limitations could not be included in the main 
text. 




Fig. 20 3D reconstructions v obtained while computing the 
lower bound. The input image used in each experiment is 
shown in the 1^* row after the background was "subtracted". 
The colors in these images were left for clarity only, our frame- 
work does not consider these colors, only the foreground prob- 
ability at each pixel. For each input image two reconstructions 
were computed, and for each reconstruction, two views are 
shown (as in Fig. |12|). The reconstructions correspond to the 
solutions with the highest (2^^ and 3^^ rows) and lowest (4*^ 
and 5*^ rows) upper bound. In the second reconstruction for 
the glass (4*^ column, 4*^ and 5*^ rows) the class 'glasses' 
was mistaken for the class 'mugs.' 



12 Notes on the coparision with [15j 

As mentioned earlier, we consider the approach in [15] 
to be the closest to ours. That approach, however, is 
markedly different from ours and we had to make sev- 
eral adaptations to be able to compare that approach 
with ours. Some of these adaptations were necessary 
because the source code for the approach in [15 was 
not available, but only an executable program was. In 
this section we describe these adaptations. 

The experiment over which we compare our method 
with is to find the translation of a bottle in space 
given an image of the bottle. We use the same image 
of the bottle as input in both frameworks. Our frame- 
work, however, also receives an image of the background 
which it uses to compute the foreground probability im- 
age (FPI), while [15] normally works directly with an 
RGB image. Thus, to make the comparison fair, we 
create an RGB image from the FPI by defining each 
channel of the RGB image to be equal to the FPI. This 
image is the input provided to [15 . 

Another input required by both methods is the cam- 
era matrix to map points in 3D space to the camera 
retina. Our method takes in as input a general camera 
matrix Mg which we obtain using standard calibration 
methods and a grid of points in known 3D positions. 
This matrix can be written as Mg = KglloT^ where 
Kg is a 3 X 3 calibration matrix, 77 is a 3 x 4 projec- 
tion matrix, and T is a 4 x 4 euclidean transformation. 
The framework in [15 , however, relies on a simplified 
form of the camera matrix, Mg, that considers the fo- 
cal length to be the only calibration parameter. While 
other intrinsic camera parameters might be available to 
the user, the program implementing the method in [15] 
does not consider these parameters. This matrix can 
be written as Mg = KgUoT^ where Kg is a simplified 
3x3 calibration matrix (only depending on the focal 
length), and 77^ and T are as before. Therefore, to make 
the comparison fair, we pre-transform the input image 
passed to the method in by KsKg~^ ^ so that both 
methods use effectively the same camera matrix {Mg). 

Another adaptation was necessary because the frame- 
work in [15 returns a transformation up to a change of 
scale. In other words, the framework in does not 
estimate the distance from the camera to the object 
and the object's actual size, while our approach does. 
Thus, we use the actual height of the bottle to correct 
the scale of the bottle and its position on the ground 
plane. 



Fig. 22 Segmentations q corresponding t o th e best solutions 
depicted in the 2^^ and 3^^ rows of Fig. [20] obtained using 
A = Xopt as in Fig.[20|(a) and A = 2Xopt (b). 
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Fig. 21 Additional examples of 3D reconstructions. (1^* row) Input image used in each experiment after the background 
was "subtracted". The colors were left only for clarity. Our framework does not consider these colors, only the foreground 
probability at each pixel. {2^^ and 3^^ rows) Two orthogonal views of the lower reconstruction v obtained for the best solution. 
(4*^ and 5*^ rows) Two orthogonal views of the lower reconstruction v obtained for the worst solution. (6*^ and 7*^ rows) 
Two orthogonal views of the upper reconstruction v obtained for the best solution. (8*^ and 9*^ rows) Two orthogonal views 
of the upper reconstruction v obtained for the worst solution. 
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Fig. 23 Segmentations q (in (a) and (b)) and q (in (c) and (d)) corresponding to the best solutions depicted in Fig. 
obtained using A = Xopt ((a) and (c)) and A = 2Xopt ((b) and (d)). 



